**Unformatted text preview:**

Optimizing for the Memory HierarchyTopicsTopics Impact of caches on performance Memory hierarchy considerationsSystems I2Cache Performance MetricsMiss RateMiss Rate Fraction of memory references not found in cache(misses/references) Typical numbers: 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc.Hit TimeHit Time Time to deliver a line in the cache to the processor (includestime to determine whether the line is in the cache) Typical numbers: 1-3 clock cycle for L1 5-12 clock cycles for L2Miss PenaltyMiss Penalty Additional time required because of a miss Typically 100-300 cycles for main memory3Writing Cache Friendly CodeRepeated references to variables are good (temporalRepeated references to variables are good (temporallocality)locality)Stride-1 reference patterns are good (spatial locality)Stride-1 reference patterns are good (spatial locality)Examples:Examples: cold cache, 4-byte words, 4-word cache blocksint sumarrayrows(int a[M][N]){ int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}int sumarraycols(int a[M][N]){ int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}Miss rate = Miss rate = 1/4 = 25% 100%4The Memory MountainEmpirical studies of memory system behaviorEmpirical studies of memory system behaviorRead throughput (read bandwidth)Read throughput (read bandwidth) Number of bytes read from memory per second (MB/s)Memory mountainMemory mountain Measured read throughput as a function of spatial andtemporal locality. Compact way to characterize memory system performance.5Memory Mountain Test Function/* The test function */void test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride)result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */}/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}6Memory Mountain Main Routine/* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */#define MAXBYTES (1 << 23) /* ... up to 8 MB */#define MAXSTRIDE 16 /* Strides range from 1 to 16 */#define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */int main(){ int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n"); } exit(0);}7The Memory Mountains1s3s5s7s9s11s13s158m2m512k128k32k8k2k020040060080010001200read throughput (MB/s)stride (words)working set size (bytes)Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cacheRidges ofTemporalLocalityL1L2memSlopes ofSpatialLocalityxe8Ridges of Temporal LocalitySlice through the memory mountain with stride=1Slice through the memory mountain with stride=1 illuminates read throughputs of different caches andmemory0200400600800100012008m4m2m1024k512k256k128k64k32k16k8k4k2k1kworking set size (bytes)read througput (MB/s)L1 cacheregionL2 cacheregionmain memoryregion9A Slope of Spatial LocalitySlice through memory mountain with size=256KBSlice through memory mountain with size=256KB shows cache block size.0100200300400500600700800s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16stride (words)read throughput (MB/s)one access per cache line10Matrix Multiplication ExampleMajor Cache Effects to ConsiderMajor Cache Effects to Consider Total cache size Exploit temporal locality and keep the working set small (e.g., by usingblocking) Block size Exploit spatial localityDescription:Description: Multiply N x N matrices O(N3) total operations Accesses N reads per source element N values summed per destination» but may be able to hold in register/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}Variable sumheld in register11Miss Rate Analysis for Matrix MultiplyAssume:Assume: Line size = 32B (big enough for 4 64-bit words) Matrix dimension (N) is very large Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rowsAnalysis Method:Analysis Method: Look at access pattern of inner loopCAkiBkjij12Layout of C Arrays in Memory(review)C arrays allocated in row-major orderC arrays allocated in row-major order each row in contiguous memory locationsStepping through columns in one row:Stepping through columns in one row: for (i = 0; i < N; i++)sum += a[0][i]; accesses successive elements if block size (B) > 4 bytes, exploit spatial locality compulsory miss rate = 4 bytes / BStepping through rows in one column:Stepping through rows in one column: for (i = 0; i < n; i++)sum += a[i][0]; accesses distant elements no spatial locality! compulsory miss rate = 1 (i.e. 100%)13Matrix Multiplication (ijk)/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}A B C(i,*)(*,j)(i,j)Inner loop:Column-wiseRow-wise FixedMisses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C0.25 1.0 0.014Matrix Multiplication (jik)/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}A B C(i,*)(*,j)(i,j)Inner loop:Row-wise Column-wiseFixedMisses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C0.25 1.0 0.015Matrix Multiplication (kij)/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}A B C(i,*)(i,k) (k,*)Inner loop:Row-wise Row-wiseFixedMisses per Inner Loop Iteration:Misses per Inner Loop Iteration:A B C0.0 0.25 0.2516Matrix Multiplication (ikj)/* ikj

View Full Document