The rate of execution of given process (i.e. a running program) will typically be limited by one bit of hardware, perhaps two. To increase the performance, you have to know what this bottleneck is.

So let's look at some typical hardware:

High-level diagram of a compute node

In order to do any work, the CPU cores must execute instructions and move data to/from memory, disk, and network.

Disk, network, and other system call bottlenecks

We will only briefly cover the scenario when a program is limited by network and file I/O, because if there is a solution it is usually pretty obvious. Other system calls, like printing to console, are similarly simple. When we get to profiling, we'll also show a couple of ways of discovering that this is the bottleneck.

A common cause of disk I/O bottlenecks is opening and closing files rapidly. Every time a process needs to ask the operating system to open or close a file, it takes about 0.1 ms. Just writing to the file is 100x faster. We'll see a demonstration of this later.

An exception to the rule that these bottlenecks are easy to see and resolve can be found in large distributed memory codes, where time spent communicating between nodes can limit the performance. If you find yourself in such a situation (a parallel profiler will tell you this), find an expert to help you and be prepared to spend a few months or years developing a new code...

Cache

Before we can discuss the optimisation of the actual instructions that do work, we must examine the CPU a bit closer.

As this diagram of a 4-core Intel i7 shows, a fair amount of CPU space is dedicated to a large cache, which is used to invisibly (to the programmer) preload data from RAM onto the CPU. This image actually understates the space dedicated to cache, because each core has L2 and L1 caches as well! Probably >50% of the chip is occupied by cache.

This should tell you that proper use of cache is important. The details of how caches work in practice varies a lot between chips, but there are general rules that we can follow to improve most codes. In addition, when a byte of data is accessed from RAM, an entire cache line is fetched, usually 64 or 128 bytes.

Note that cache, while present on the chip, is not immediately accessible for operations. The CPU does arithmetic operations on registers, which get loaded with the output of other operations or data from cache/memory.

The CPU pipeline

"We have to deeper."

All modern CPU's are pipelined. At the end of the single-core era, some CPU's had 30-stage pipelines, but modern chips have reduced pipelines down to less than 20. We can illustrate this with a generic 4-step "Fetch-decode-execute-write" pipeline:

a generic 4-stage instruction pipeline

Because parts of a CPU are anyway designed to do different things, several instructions can be executed at once by fetching a new instruction (reading the next line of assembly) while the previous instruction is being decoded (translating the instruction's hexadecimal value into operations). This is known as Instruction-level Parallellism (ILP) The advantage of this setup is better utilisation of the CPU's hardware and (in this case) up to four sequential instructions of a single program executing in parallel. In real systems, a CPU core is only running at full performance if 10-20 instructions are in the pipeline at once.

In order to fill the pipeline, the CPU must be able guess at the instructions that will follow e.g. an "if" statement. Additionally, all of the instructions must have their data ready (i.e. in register if doing arithmetic, in L1 cache if loading a register, etc). If there is a cache miss, or the CPU guesses wrong, all the stuff in the pipeline must wait or be purged entirely, reducing the effective ILP. Modern CPUs ameliorate the impossibility of this to some extent by executing some instructions out of order.

Vector/SIMD operations

One of the main workhorses of a CPU core is the Arithmetic Logical Units (ALU), which does the math. Modern ALUs are equipped with wide "vector" registers, which perform SIMD parallellism. Typical high-precision floating point numbers use 4 bytes (32 bits). A vector register can store up to 16 doubles, and associated vector instructions can do operations on all the elements simultaneously.

Needless to say, this is super handy if you have an algorithm that does a lot of the same operations on data that lies adjacent. But even if you do, the compiler often needs to be coaxed to issue such instructions, if you don't do it by hand. In either case, you will be subject to the whims of machine spirits — prayers and offerings can go a long way.

Parallellisation

Of course, if you only issue instructions to one CPU core at a time with a single-threaded program, you're potentially missing out on a lot of performance. We'll discuss this more later.