Profiling
In many cases, the system will spend a majority of its time in a small amount of code. Clearly, this is the part of the code you need to improve.
Amdahl's Law
This law usually comes up in the context of parallelisation, but it is equally applicable to any other optimisation technique. Amdahl's law points out that the overall speedup gained by optimising a single section of a program is limited by the fraction of time that the improved section required.
More specifically, if you are optimising a section of code that takes p % of the total runtime, you can achieve maximum 100/(100-p) times speedup.
Knowing the expected outcome of an optimisation effort is helpful when deciding whether to spend time on it.
Time measurement
A simple method for timing is using the UNIX time function.
user@system:~/mydir$ time ./myprogram
real 0m0.008s
user 0m0.000s
sys 0m0.002s
This is super handy for checking whether a change made a difference, but can also hint at potential bottlenecks.
real is the "wall-time", the actual physical time that the program ran. This is what you want to improve.
user is time spent in "user mode", normal program execution. Multithreaded processes will report the sum of the times of all threads.
sys is time spent in system calls. Programs that spend their time here are usually limited by memory I/O, disk accesses, network, console output, or the like.
To see this in action, create a short python program with the following lines:
for i in range(10000):
with open("test.txt","w") as f:
f.write(str(i)+"\n")
Try running "time python3 fiotest.py" and examine the output. Then move the "with open..." line to the top of the script and compare.
The timeit module in Python is useful for timing parts of your code. The equivalent in C is using the <time.h> library.
Profilers
Profilers are tools, either internal or external, that can tell you more details about where your program is spending its time.
Try profiling the fiotest.py script:
marcusl$ python3 -m cProfile -s 'tottime' filetest.py
50003 function calls in 1.587 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.957 0.000 0.989 0.000 {built-in method io.open}
1 0.596 0.596 1.587 1.587 filetest.py:2()
10000 0.017 0.000 0.017 0.000 {built-in method _locale.nl_langinfo}
10000 0.012 0.000 0.028 0.000 _bootlocale.py:33(getpreferredencoding)
10000 0.005 0.000 0.005 0.000 codecs.py:186(__init__)
10000 0.002 0.000 0.002 0.000 {method 'write' of '_io.TextIOWrapper' objects}
1 0.000 0.000 1.587 1.587 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
On SNIC systems, you have access to Intel VTune, which can do the same thing for C, Fortran, and other codes. Many IDE's have a profiler, including XCode.
Basically all Linux systems with gcc also have the profiling tool gprof and the memory checker valgrind, which may not be super nice to use but are perfectly serviceable in a pinch.
Valgrind is actually an entire collection of tools. If you want to know something about the behaviour of your program, chances are good that something in valgrind can tell you. For instance, the callgrind tool can give a clear report on cache misses and branch prediction misses on the scale of a function.