2011年11月24日木曜日

Acceleration of programs using CPU caches effectively


Effective use of CPU caches make the performance of programs improved. This entry shows how much improvement is given with effective use of CPU caches.

Motivation
Computer data storages have hierarchy which consists of registers, cache, main memory and so on, distinguished by response time. Registers and cache memory are smaller but faster than main memory.

CPUs can directly operate registers only. So, between registers and main memory, there is about x100 gap in access speed, and to fill the gap, CPUs have cache memory inside. With effectively using CPU cache, we can improve the performance of programs.

However, programmers can not make CPUs directly to use the cache memory in the way they want, so they must write programs considering how the CPUs read/write the data on the memory to indicate CPUs indirectly how to use the cache.

* SPARC64 VIIIfx used in the K computer of RIKEN in Japan has "sector cache" functionality and programmers can directly indicate it how to use the cache. With sector cache, the cache in a CPU is divided in two sectors and programmers can instruct which sector they use to store data.

This entry demonstrates an example of performance improvement using CPU cache effectively.

Example
I show the optimization in efficiency of a program which computes the multiplication of 1000x1000 matrices using cache blocking technique.

The code before optimized (code1.c):
#include <stdio.h>
#include <stdlib.h>
 
#define N 1000
 
int main ()
{
  double A[N][N], B[N][N], C[N][N];
  int i, j, k;
  
  // initialize the matrices
  for ( i = 0; i < N; i++ )
    for ( j = 0; j < N; j++ )
      {
        A[i][j] = 1.0;
        B[i][j] = 1.0;
        C[i][j] = 0.0;
      }
  
  // matrix multiplication
  for ( i = 0; i < N; i++ )
    for ( j = 0; j < N; j++ )
      for (k = 0; k < N; k++ )
        C[i][j] += A[i][k] * B[k][j];
  
  printf ( "%f\n", C[0][0] );
  
  return 0;
}

I rewrite this code as below with cache blocking (code2.c). Of course, the result of multiplication is same as to that of code1.c.
$ cat code2.c
#include <stdio.h>
#include <stdlib.h>
 
#define N 1000
 
int main ()
{
  double A[N][N], B[N][N], C[N][N];
  int i, j, k;
  
  // initialize the matrices
  for ( i = 0; i < N; i++ )
    for ( j = 0; j < N; j++ )
      {
        A[i][j] = 1.0;
        B[i][j] = 1.0;
        C[i][j] = 0.0;
      }
  
  // matrix multiplication
  int ibl = 100;
  int ib, jb, kb;
  for (ib=0; ib<N; ib+=ibl)
    for (jb=0; jb<N; jb+=ibl)
      for (kb=0; kb<N; kb+=ibl)
        for (i=ib; i<ib+ibl; i++)
          for (j=jb; j<jb+ibl; j++)
            for (k=kb; k<kb+ibl; k++)
              C[i][j] += A[i][k] * B[k][j];
  
  printf ( "%f\n", C[0][0] );
  
  return 0;
}

Demonstration
The times taken to execute the unoptimized code and the optimized code are as below respectively:
$ gcc -Wl,-stack_size,1000000000 -O2 -o code1 code1.c
$ time ./code1 > /dev/null

real    0m7.595s
user    0m7.494s
sys     0m0.035s

$ gcc -Wl,-stack_size,1000000000 -O2 -o code2 code2.c
$ time ./code2 > /dev/null

real    0m2.509s
user    0m2.462s
sys     0m0.030s
* The -Wl option is to designate ld to allocate large arrays on the stack

The result represented as a chart:



Cache blocking improve the performance by about x3.

Comparing the performance as the size of cache blocking
Next, I show how the performance changes as the size of block. I changed the size of the block to 100, 200, 250, 500, then timed to calculation.

As a chart:


With the block size of 500, the performance degrades massively, which represents that the expansion of the block size leads to overflow of data from the cache.

The bytes of the 500x500 sized block is 2MB, with double typed array, so, considering other programs running in background, it agrees with the CPU cache size of my computer :)

--

0 件のコメント: