Loop Optimizations II

5 min readAug 4, 2023

If you’re a game developer, operating system developer, system-architecture IT professional, real-time system designer, or just a hobbyist who considers the performance of their code, keep reading.

To Create the Illusion, you need to be faster

Although this topic makes out(💋) closely with The Computer and CPU architecture, considering the readers may not have a background on those topics, anything I’ll write here is gonna be superficial yet useful to speed up your programs.

Almost all of the programs rely heavily on loops, which are segments of code executed repeatedly. Consequently, a significant portion of the program’s execution time is consumed within these loops. Even minor modifications to this crucial code can greatly affect the program’s performance. Therefore, it is of utmost importance to thoroughly assess the performance of these hot loops and explore potential enhancements.

We can divide loop optimizations into two categories; Low-level and High-level optimizations respectively. Low-level optimizations are small adjustments in the code of the loop, you may think “reordering things a little” such that it improves the performance. High-level optimizations are the subject of restructuring the aforementioned loop such that it gives the best performance.

Menu
Low-level optimizations (Previous Article)
- Loop Invariant Code Motion
- Loop Unrolling
- Loop Strength Reduction
- Loop Unswitching
High-Level Optimizations (This Article)
- Loop Interchange
- Loop Blocking
- Loop Fusion and Fission
Afiyet Olsun :)

High-Level Optimizations

High-level optimizations are the subject of restructuring the loops such that they give the best performance. These “may” not be easily done by a compiler so it’s the developers’ responsibility to look for these optimizations at the end.

Loop Interchange: It’s about changing the loop order, say you’ve 2 loops one of which is the inner of the outer loop, we can change the order to increase cache hits

// BEFORE LOOP INTERCHANGE
void matmul(int N, float a[][2048], float b[][2048], float c[][2048]) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                c[i][j] = c[i][j] + a[i][k] * b[k][j]; 
              // b[k][j] has column access which cannot expoit cache.
            }
        }
    }
}

// AFTER LOOP INTERCHANGE
void matmul(int N, float a[][2048], float b[][2048], float c[][2048]) {
    for (int i = 0; i < N; i++) {
        for (int k = 0; k < N; k++) {  // Loop k is now the innermost loop
            for (int j = 0; j < N; j++) {  
                c[i][j] = c[i][j] + a[i][k] * b[k][j];
            }
        }
    }
}

In the C lang family, arrays are mapped to memory in row-major order, you can check that by printing the addresses of a[0][0] and a[0][1] in a single statement

int main()
{  
    int a[5][5];
    std::cout << &a[0][0] << "\n" << &a[0][1] << std::endl; 
    return 0;
}
// OUTPUT
0x7ffd29a4fe00
0x7ffd29a4fe04

Since columns are consecutive, operating on columns gives better cache performance than operating on rows.

Here’s a quick benchmark for 50x50 matrix multiplication (remember this size is quite small compared to real-world examples)

https://quick-bench.com/q/y50plSFtYamifPauGF114QneYR0

Loop Blocking: Loop blocking is another type of high-level optimization that aims to exploit caches as much as possible. Loop interchange is the first step of optimizing, in cases where we operate on “columns” such that it doesn’t fit to CPU cache, we use Loop blocking to make it fit.

The idea of this transformation is to split the multi-dimensional execution range into smaller chunks (blocks or tiles) so that each block will fit in the CPU caches

https://www.intel.com/content/www/us/en/developer/articles/technical/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture.html

// BEFORE LOOP BLOCKING
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    a[i][j] += b[j][i];

// AFTER LOOP BLOCKING
int block_size = 64 / sizeof(int); // 64 is the tipical cacheline size in Bytes
for (int ii = 0; ii < N; ii+=block_size) 
  for (int jj = 0; jj < N; jj+=block_size)
    for (int i = ii; i < ii+block_size; i++)
      for (int j = jj; j < jj+block_size; j++)
        a[i][j] += b[j][i];

Here’s a quick benchmark for a 150x150 2d array

https://quick-bench.com/q/0dk7GDqJddQ_USBcXkaZpaZGFyA

Loop Fusion and Fission: Separate loops can be fused together when they
iterate over the same range and do not reference each other’s data.
The opposite procedure is called Loop Distribution (Fission) when the loop is split into separate loops.

// BEFORE FUSION
for (int i = 0; i < N; i++)
  a[i].x = b[i].x;
for (int j = 0; j < N; j++)
  a[i].y = b[i].y;

// AFTER FUSION
for (int i = 0; i < N; i++) {
  a[i].x = b[i].x;
  a[i].y = b[i].y;
}

Thus we don’t need to iterate N times on the same data.

Takeaways

There’re lots of optimization and monitoring techniques to improve the performance of our program.
One must know that the more you optimize your code the more you get close to the system and hardware that your program’s running on.
Loops are generally the hotspots in your program, optimizing the loops can potentially boost your application’s performance
There’s a saying in Turkey “ Tie a donkey to a solid stake “ meaning do everything you can then trust god. Regarding this, do all the possible optimizations then trust the compiler.

Examples From:
Denis Bakhvalov — Performance Analysis and Tuning on Modern CPUs

Have a nice one.

Loop Optimizations II

High-Level Optimizations

Written by Osman Yasal

No responses yet