Parallelisation

Multithreading can be done in multiple ways, even using a single programming language. 

The easiest and best solution in many cases is to use a library that does what you need and is already parallelised. Then someone else has done the hard work and hopefully has done it well. With that said:

Measure the scaling/speedup of your parallel program

Don't trust people's words about what is fast and what is parallel. They're probably lying or mistaken or talking about a different problem size or assumed different settings. You must time your workload.

  1. Get a baseline timing with a single thread on a representative input set. Ideally, use a good single-threaded implementation for this.
  2. Run with 2, 4, 8, etc threads using the same input. This gives you the strong scaling, the speedup for a given problem. You'll usually find that 8 threads don't give you 8x speedup, even on a machine with 8 cores.
  3. Run with 2, 4, 8, etc threads, also doubling the size of the input. This shows you the weak scaling, which is the marginal capacity for doing larger problems. 

A Parallell boid simulation

Here I try to present an instructive example of how a relatively simple program turned out to be difficult to parallelise. 

Boids are self-propelled particles in a 2D or 3D space. They move about in the space and change direction when they interact with nearby boids.

A boid simulation looks like this:

boid algorithm

Finding the neighbours naively involves checking whether pair of boids is close, which is O(N2). Since boids only interact with nearby boids, a spatial subdivision scheme that uses a hierarchy to discard entire groups of boids from consideration at once seems like a better algorithm. In 2D, a quadtree reduces the complexity to O(N log N).

Such a tree can be built with

quadtree algorithm

Then, finding the nearest neighbour looks like this:

nearest neighbours search algo

The fast boid algorithm is then:

fast boid alg

Profiling a C implementation of this shows that 96% of the runtime is spent in the force interaction loop between lines 7-11.  Amdahl’s Law suggests that we should be able to achieve a maximum speedup of 25x with infinite parallelism, and about 6.25x speedup on 8 threads or 10x speedup on 16 threads. Moreover, this loop is embarrassingly parallel.

This was quickly done, and we measured the speedup with glee.

Of course, this wouldn't be a lesson in this course if it turned out as we'd hoped...

plot of speedup of parallel fast boids, showing poor results.

What happened??

More profiling showed that multithreading the interaction loop slowed down the tree building step by 65%. Additionally, the multithreaded code segment itself had 4% inefficiency. Some of this time is overhead from starting and waiting for threads.

The bottom line

Test, test, and test your assumptions about what makes things quicker (or not).