Parallelisation
Multithreading can be done in multiple ways, even using a single programming language.
The easiest and best solution in many cases is to use a library that does what you need and is already parallelised. Then someone else has done the hard work and hopefully has done it well. With that said:
Measure the scaling/speedup of your parallel program
Don't trust people's words about what is fast and what is parallel. They're probably lying or mistaken or talking about a different problem size or assumed different settings. You must time your workload.
- Get a baseline timing with a single thread on a representative input set. Ideally, use a good single-threaded implementation for this.
- Run with 2, 4, 8, etc threads using the same input. This gives you the strong scaling, the speedup for a given problem. You'll usually find that 8 threads don't give you 8x speedup, even on a machine with 8 cores.
- Run with 2, 4, 8, etc threads, also doubling the size of the input. This shows you the weak scaling, which is the marginal capacity for doing larger problems.
A Parallell boid simulation
Here I try to present an instructive example of how a relatively simple program turned out to be difficult to parallelise.
Boids are self-propelled particles in a 2D or 3D space. They move about in the space and change direction when they interact with nearby boids.
A boid simulation looks like this:
Finding the neighbours naively involves checking whether pair of boids is close, which is O(N2). Since boids only interact with nearby boids, a spatial subdivision scheme that uses a hierarchy to discard entire groups of boids from consideration at once seems like a better algorithm. In 2D, a quadtree reduces the complexity to O(N log N).
Such a tree can be built with
Then, finding the nearest neighbour looks like this:
The fast boid algorithm is then:
Profiling a C implementation of this shows that 96% of the runtime is spent in the force interaction loop between lines 7-11. Amdahl’s Law suggests that we should be able to achieve a maximum speedup of 25x with infinite parallelism, and about 6.25x speedup on 8 threads or 10x speedup on 16 threads. Moreover, this loop is embarrassingly parallel.
This was quickly done, and we measured the speedup with glee.
Of course, this wouldn't be a lesson in this course if it turned out as we'd hoped...
What happened??
More profiling showed that multithreading the interaction loop slowed down the tree building step by 65%. Additionally, the multithreaded code segment itself had 4% inefficiency. Some of this time is overhead from starting and waiting for threads.
The bottom line
Test, test, and test your assumptions about what makes things quicker (or not).