Page 1Cache PerformanceOctober 3, 2007Cache PerformanceOctober 3, 2007TopicsTopics Impact of caches on performance The memory mountainclass11.ppt15-213“The course that gives CMU its Zip!”–2–15-213, F’07Intel Pentium III Cache HierarchyIntel Pentium III Cache HierarchyProcessor ChipProcessor ChipL1 Data1 cycle latency16 KB4-way assocWrite-through32B linesL1 Instruction16 KB, 4-way32B linesRegs.L2 Unified128KB--2 MB4-way assocWrite-backWrite allocate32B linesL2 Unified128KB--2 MB4-way assocWrite-backWrite allocate32B linesMainMemoryUp to 4GBMainMemoryUp to 4GB–3–15-213, F’07Cache Performance MetricsCache Performance MetricsMiss RateMiss Rate Fraction of memory references not found in cache(misses / references) Typical numbers:z 3-10% for L1z can be quite small (e.g., < 1%) for L2, depending on size, etc.Hit TimeHit Time Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) Typical numbers:z 1-2 clock cycle for L1z 5-20 clock cycles for L2Miss PenaltyMiss Penalty Additional time required because of a missz Typically 50-200 cycles for main memory (Trend: increasing!)Aside for architects:-Increasing cache size?-Increasing block size?-Increasing associativity?–4–15-213, F’07Writing Cache Friendly CodeWriting Cache Friendly Code••Repeated references to variables are goodRepeated references to variables are good((temporal localitytemporal locality))••StrideStride--1 reference patterns are good (1 reference patterns are good (spatial localityspatial locality))••Examples:Examples:cold cache, 4-byte words, 4-word cache blocksint sum_array_rows(int a[M][N]){int i, j, sum = 0;for (i = 0; i < M; i++)for (j = 0; j < N; j++)sum += a[i][j];return sum;}int sum_array_cols(int a[M][N]){int i, j, sum = 0;for (j = 0; j < N; j++)for (i = 0; i < M; i++)sum += a[i][j];return sum;}Miss rate = Miss rate = 1/4 = 25%100%Page 2–5–15-213, F’07The Memory MountainThe Memory MountainRead throughput (read bandwidth)Read throughput (read bandwidth) Number of bytes read from memory per second (MB/s)Memory mountainMemory mountain Measured read throughput as a function of spatial and temporal locality. Compact way to characterize memory system performance. –6–15-213, F’07Memory Mountain Test FunctionMemory Mountain Test Function/* The test function */void test(int elems, int stride) {int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride)result += data[i];sink = result; /* So compiler doesn't optimize away the loop */}/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){double cycles;int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}–7–15-213, F’07Memory Mountain Main RoutineMemory Mountain Main Routine/* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */#define MAXBYTES (1 << 23) /* ... up to 8 MB */#define MAXSTRIDE 16 /* Strides range from 1 to 16 */#define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */int main(){int size; /* Working set size (in bytes) */int stride; /* Stride (in array elements) */double Mhz; /* Clock frequency */init_data(data, MAXELEMS); /* Initialize each element in data to 1 */Mhz = mhz(0); /* Estimate the clock frequency */for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n");}exit(0);}–8–15-213, F’07The Memory MountainThe Memory Mountains1s3s5s7s9s11s13s158m2m512k128k32k8k2k020040060080010001200L1L2memxeSlopes ofSpatialLocalityPentium III550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cacheRidges ofTemporalLocalityWorking set size(bytes)Stride (words)Throughput (MB/sec)Page 3–9–15-213, F’07X86-64 Memory MountainX86-64 Memory Mountains1s5s9s13s17s21s25s29128m4m128k4k0100020003000400050006000Read Throughput (MB/s)Stride (words)Working Set Size (bytes)Slopes ofSpatial LocalityRidges ofTemporal LocalityMemL2L1Pentium Nocona Xeon x86-643.2 GHz12 Kuop on-chip L1 trace cache16 KB on-chip L1 d-cache1 MB off-chip unified L2 cache –10–15-213, F’07Opteron Memory MountainOpteron Memory Mountains1s5s9s13s17s21s25s29128m4m128k4k050010001500200025003000Read throughput (MB/s)Stride (words)Working set (bytes)AMD Opteron2 GHZL1L2Mem–11–15-213, F’07Ridges of Temporal LocalityRidges of Temporal LocalitySlice through the memory mountain with stride=1Slice through the memory mountain with stride=1 illuminates read throughputs of different caches and memory0200400600800100012008m4m2m1024k512k256k128k64k32k16k8k4k2k1kworking set size (bytes)read througput (MB/s)L1 cacheregionL2 cacheregionmain memoryregion–12–15-213, F’07A Slope of Spatial LocalityA Slope of Spatial LocalitySlice through memory mountain with size=256KBSlice through memory mountain with size=256KB shows cache block size.0100200300400500600700800s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16stride (words)read throughput (MB/s)one access per cache linePage 4–13–15-213, F’07Matrix Multiplication ExampleMatrix Multiplication ExampleMajor Cache Effects to ConsiderMajor Cache Effects to Consider Total cache sizez Exploit temporal locality and keep the working set small (e.g., use blocking) Block sizez Exploit spatial localityDescription:Description: Multiply N x N matrices O(N3) total operations Accessesz N reads per source elementz N values summed per destination» but may be able to hold in register/* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];c[i][j] = sum;}} /* ijk */for (i=0; i<n; i++) {for (j=0; j<n; j++) {sum = 0.0;for (k=0; k<n; k++) sum += a[i][k] * b[k][j];c[i][j] = sum;}} Variable sumheld in register–14–15-213, F’07Miss Rate Analysis for Matrix MultiplyMiss Rate Analysis for Matrix MultiplyAssume:Assume: Line size = 32B (big enough for four 64-bit words) Matrix dimension (N) is very largez Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rowsAnalysis Method:Analysis Method: Look at access pattern of inner
View Full Document