Loop Optimizations I

5 min readAug 3, 2023

If you’re a game developer, operating system developer, system-architecture IT professional, real-time system designer or just a hobbyist who considers the performance of their code, keep reading.

To Create the Illusion, you need to be faster

Although this topic zips closely with The Computer and CPU architecture, considering the readers may not have a background on those topics, anything I’ll write here is gonna be superficial yet useful to speed up your programs.

Almost all of the programs rely heavily on loops, which are segments of code executed repeatedly. Consequently, a significant portion of the program’s execution time is consumed within these loops. Even minor modifications to this crucial code can greatly affect the program’s performance. Therefore, it is of utmost importance to thoroughly assess the performance of these hot loops and explore potential enhancements.

We can divide loop optimizations into two categories; Low-level and High-level optimizations respectively. Low-level optimizations are small adjustments in the code of the loop, you may think “reordering things a little” such that it improves the performance. High-level optimizations are the subject of restructuring the aforementioned loop such that it gives the best performance.

Menu
Low-level optimizations (This Article)
- Loop Invariant Code Motion
- Loop Unrolling
- Loop Strength Reduction
- Loop Unswitching
High-Level Optimizations (Next Article)
- Loop Interchange
- Loop Blocking
- Loop Fusion and Fission
Afiyet Olsun :)

To effectively optimize a loop, it is crucial to know what is the bottleneck of the loop, once you examine your program with a profiler such as the intel VTune or perf (in Linux) and find the loops that most of the execution time is attributed to, you can start determining the bottlenecks. Such bottlenecks mainly span the Memory Latency — Bandwidth and Compute capabilities of the machine. The roofline model is a good start for comparing the theoretical HW(hardware) maximum with the performance you get.

from the book Performance Analysis and Tuning on Modern CPUs

Low-Level Optimizations

Optimizations in this umbrella are about re-ordering the code in the loop. Generally, compilers are good at doing such transformations;
however, there are still cases when a compiler might need a developer’s support.

Loop Invariant Code Motion (LICM): Expressions evaluated in a loop that never change are called loop invariants. Since their value doesn’t change across loop iterations, we can move loop invariants expressions outside of the loop

// BEFORE LICM
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    a[j] = b[j] * c[i];

// AFTER LICM
for (int i = 0; i < N; ++i)
  auto cval = c[i];
  for (int j = 0; j < N; ++j)
    a[j] = b[j] * cval;

** By doing so, we saved N times memory read operation and the cache line occupation for array c (when needed it can be evicted)

Loop Unrolling: An induction variable is a variable in a loop, whose value is a function of the loop iteration number. the i variable is an example of it and the loop unrolling aims to do as much operations in a loop as possible by manipulating the “i” variable.

//BEFORE UNROLLING
for (int i = 0; i < N; ++i)
  a[i] = b[i] * c[i];

// AFTER UNROLLING BY FACTOR 2
for (int i = 0; i < N; i+=2){
  a[i] = b[i] * c[i];
  a[i+1] = b[i+1] * c[i+1];
}

A little Detail HERE: whenever we access a loop variable say it’s a[0], this value migrates from dram to caches and to the CPU itself. Since this operation is expensive, the CPU fetches a group of data(enough to fill a single cache line which is 64 bytes of data). Say the array a is an integer array (4bytes) meaning when we access a[0], all the data to a[15] is stored in the CPU cache. Since caches are very fast, we can extend this factor 2 to factor 16. Thus in a single loop, we can operate on the data stored in the cache at once and the next operation requires the fetch of a new cache line.

However, I suggest no developer should unroll a loop by hand.

It’s a very well-studied optimization technique all the compilers have highly optimized this moreover, some processors (intel) have an “embedded unroller” in their out-of-order execution engine.
While the processor is waiting for the load from the first iteration to finish, it may speculatively start executing the load from the second iteration. This
spans to multiple iterations ahead

Loop Strength Reduction:

The idea of LSR is to replace expensive instructions with cheaper ones. Such transformation can be applied to all expressions that use an induction variable.

Cpu load of operations
Double-precision floating point (FP) >> Single-Precision FP >> Integer calculations >> Constants

// BEFORE STRENGTH REDUCTION 
for (int i = 0; i < N; ++i)
  a[i] = b[i * 10] * c[i];

// AFTER STRENGTH REDUCTION
int j = 0;
for (int i = 0; i < N; ++i) {
  a[i] = b[j] * c[i];
  j += 10;
}

Loop Unswitching:

If a loop has a conditional inside and it is invariant, we can move it
outside of the loop. Remember any if — else statement contains at least a single jump and multiple comparison operations. Doing that per iteration is expensive and may potentially cause “bad speculation”

// BEFORE UNSWITCHING
for (i = 0; i < N; i++) {
  a[i] += b[i];
  if (c)
    b[i] = 0;
}

// AFTER UNSWITCHING
if (c)
  for (i = 0; i < N; i++) {
    a[i] += b[i];  
    b[i] = 0;
  }
else
  for (i = 0; i < N; i++) {
    a[i] += b[i];
  }

The downside, as you can see, is code duplication but this downside can be balanced with the idea of “ now we can perform different optimizations to the same code”

PS: Compilers are very good at performing low-level optimizations you can do you basic trics but you can aslo rely on the compiler on obvious ones. However it’s quite often required a developer intervention to perform for high-level optimizations

Since I don’t wanna bloat up this article, I’m gonna share it as a separate paper.

Have a nice one.

Loop Optimizations I

Low-Level Optimizations

Written by Osman Yasal

No responses yet