Performance analysis
Once you know the hotspots in your code, it may not be immediately evident what is making it slow.
Generally speaking, the execution of code is limited by the rate of arithmetic operations or by the rate of data flowing into the CPU. But which is it, and how can we tell?
The Roofline model
https://www.sciencedirect.com/topics/computer-science/roofline-model Links to an external site.
In the Roofline model, the arithmetic intensity of a given code kernel is calculated (or estimated) and plotted on a roofline graph for the test hardware. Focus is put on the kernel of a code, the small contents of the inner-most loop where the CPU is spending most of its time. For performance issues that stem from other sources, the roofline model is not useful.
Arithmetic intensity (AI)
AI, sometimes called numeric intensity or operational intensity, is typically given in flops/word or flops/byte. This is simply the number of operations in the ALU divided by the number of words or bytes transferred from memory. A high AI corresponds to a code that does a lot of arithmetic on every data element, which will usually lead it to be limited by the speed of the arithmetic operations and is present somewhere on the flat part of the graph. A low AI means that the ALU is partially idle during execution, and the code is somewhere on the banked part of the roofline.
The operational intensity of a code is a consequence of the algorithm and the CPU's instruction set (i.e. assembly language). Sometimes, poor coding can artificially inflate or deflate it, but (for example) a correctly written kernel that sums two arrays will have an OI of 0.5 flops/word and there's nothing you can do about it.
Often, the algorithm will tell you roughly what the OI of your kernel is, but you can be even more sure by looking at the assembly code and counting instructions. This requires some knowledge of assembly... which isn't actually so bad.
From a naive matrix-matrix multiply (double precision):
40: f3 0f 10 01 movss (%rcx),%xmm0 :4 B
44: 83 c2 01 add $0x1,%edx :1 op
47: f3 0f 59 c1 mulss %xmm1,%xmm0 :1 op
4b: 48 83 c1 04 add $0x4,%rcx :1 op
4f: f3 0f 58 00 addss (%rax),%xmm0 :1 op 4 B
53: f3 0f 11 00 movss %xmm0,(%rax) :4 B
57: 48 83 c0 04 add $0x4,%rax :1 op (for loop bookkeeping)
5b: 81 fa 00 08 00 00 cmp $0x800,%edx :1 op (for loop bookkeeping)
61: 75 dd jne 40 <matmul_ref+0x40>
Total (ignoring bookkeeping): 4 ops and 12 bytes, giving us 0.333 ops/byte or 1.33 ops/word.
The same program had an auto-SIMDised version of the function:
b0: 0f 28 04 01 movaps (%rcx,%rax,1),%xmm0 :16 B
b4: 0f 59 c1 mulps %xmm1,%xmm0 :4 op
b7: 0f 58 04 02 addps (%rdx,%rax,1),%xmm0 :4 op 16 B
bb: 0f 29 04 02 movaps %xmm0,(%rdx,%rax,1) :16 B
bf: 48 83 c0 10 add $0x10,%rax :1 op (for loop bookkeeping)
c3: 48 3d 00 20 00 00 cmp $0x2000,%rax :1 op (for loop bookkeeping)
c9: 75 e5 jne b0 <matmul_autovec+0x30>
Total (ignoring bookkeeping)l: 8 ops, 48 bytes, giving us 0.333 ops/byte or 1.33 ops/word.
Drawing a roofline graph
This can be challenging, especially if you lack certain information about your computer's hardware. On Linux machines, you can use /proc/cpuinfo. On Mac, the command sysctl -n machdep.cpu is helpful. You need information about the clock speed, but also the vector instruction sets (e.g. avx2 among "flags"), then a search on the web can tell you about the supported SIMD instructions.
You also need to find the maximum bandwidth to memory, and this information is not always available to look up. I've found that a simple test in C using memcpy() provides usable info.
In the example above, borrowed from the best Roofline article ("Roofline:An insightful Visual Performance model for multicore Architectures"), you can see an example of computational ceilings for a certain computer. With "just" perfect thread-level parallellism (TLP), a program can attain at most 2 GFlops. With SIMD instructions doing 4 effective operations every instruction, the max performance increases accordingly. The last doubling of speed comes from "balancing" arithmetic operations, allowing the compiler to issue fused multiply-add instructions that effectively do two operations at once. Note that the height of some of these ceilings depends on whether you're doing single-precision or double-precision math.
This example shows memory bandwidth ceilings. A code that only relies on decent memory performance from accessing memory addresses sequentially, which means entire cache lines are consumed efficiently and cache misses are minimised, will reach the peak floating point capacity only if it requires 4 floating point operations on each byte (16 operations for each double!). By making the code NUMA-aware, ensuring that threads are accessing only the nearest memory, performance can be more than doubled. The last bit of efficiency is gained by introducing a clever prefetching scheme that uses your knowledge of the algorithm to ensure that the right data is always in the right place when needed.
A naive implementation of a code, even if nicely multithreaded, will typically perform somewhere in the light blue region of the above plot.