Loop tiling

In computer science, loop tiling, also known as loop blocking, or strip mine and interchange, is a loop optimization technique used by programmers or compilers to make the execution of certain types of loops more efficient, by increasing the locality of reference.

Overview

Loop tiling partitions a loop's iteration space into smaller chunks or blocks, so as to help ensure data used in a loop stays in the cache until it is reused. The partitioning of loop iteration space leads to partitioning of large array into smaller blocks, thus fitting accessed array elements into cache size, enhancing cache reuse and eliminating cache size requirements. An ordinary loop

for(i=0; i<N; ++i){
  ...
}

can be blocked with a block size B by replacing it with

for(j=0; j<N; j+=B)
  for(i=j; i<min(N, j+B); ++i){
    ....
  }

where min() is a function returning the minimum of its arguments.

Example

The following is an example of matrix vector multiplication. There are three arrays, each with 100 elements. The code does not partition the arrays into smaller sizes.

  int i, j, a[100][100], b[100], c[100];
  int n = 100;
  for (i = 0; i < n; i++) {
    c[i] = 0;
    for (j = 0; j < n; j++) {
      c[i] = c[i] + a[i][j] * b[j];
    }
  }

After we apply loop tiling using 2 * 2 blocks, our code looks like:

  int i, j, x, y, a[100][100], b[100], c[100];
  int n = 100;
  for (i = 0; i < n; i += 2) {
    c[i] = 0;
    c[i + 1] = 0;
    for (j = 0; j < n; j += 2) {
      for (x = i; x < min(i + 2, n); x++) {
        for (y = j; y < min(j + 2, n); y++) {
          c[x] = c[x] + a[x][y] * b[y];
        }
      }
    }
  }

The original loop iteration space is n by n. The accessed chunk of array a[i, j] is also n by n. When n is too large and the cache size of the machine is too small, the accessed array elements in one loop iteration (for example, i = 1, j = 1 to n) may cross cache lines, causing cache misses.

Tiling size

It is not always easy to decide what value of tiling size is optimal for one loop because it demands an accurate estimate of accessed array regions in the loop and the cache size of the target machine. The order of loop nests (loop interchange) also plays an important role in achieving better cache performance. Explicit blocking requires choosing a tile size based on these factors. By contrast, cache-oblivious algorithms are designed to make efficient use of cache without explicit blocking.

Loop tiling

Overview

Example

Tiling size

Further reading