Intel Pentium III Cache Hierarchy 15 213 The course that gives CMU its Zip Cache Performance October 3 2007 Regs Topics L1 Data 1 cycle latency 16 KB 4 way assoc Write through 32B lines L1 Instruction 16 KB 4 way 32B lines Impact of caches on performance The memory mountain L2 L2Unified Unified 128KB 2 128KB 2MB MB 4 way 4 wayassoc assoc Write back Write back Write Writeallocate allocate 32B 32Blines lines Main Main Memory Memory Up Upto to4GB 4GB Processor Processor Chip Chip Cache Performance Metrics Writing Cache Friendly Code Miss Rate Repeated references to variables are good temporal locality locality Fraction of memory references not found in cache misses references Typical numbers z 3 10 for L1 z can be quite small e g 1 for L2 depending on size etc StrideStride 1 reference patterns are good spatial spatial locality locality Examples cold Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers Aside for architects z 1 2 clock cycle for L1 Increasing cache size z 5 20 clock cycles for L2 Miss Penalty 3 15 213 F 07 2 class11 ppt int sum array cols int a M N int i j sum 0 for i 0 i M i for j 0 j N j sum a i j return sum for j 0 j N j for i 0 i M i sum a i j return sum Increasing block size Increasing associativity Additional time required because of a miss z Typically 50 200 cycles for main memory Trend increasing cache 4 byte words 4 word cache blocks int sum array rows int a M N int i j sum 0 15 213 F 07 4 Page 1 Miss rate 1 4 25 Miss rate 100 15 213 F 07 The Memory Mountain Memory Mountain Test Function Read throughput read bandwidth The test function void test int elems int stride int i result 0 volatile int sink Number of bytes read from memory per second MB s Memory mountain Measured read throughput as a function of spatial and temporal locality Compact way to characterize memory system performance for i 0 i elems i stride result data i sink result So compiler doesn t optimize away the loop Run test elems stride and return read throughput MB s double run int size int stride double Mhz double cycles int elems size sizeof int test elems stride warm up the cache cycles fcyc2 test elems stride 0 call test elems stride return size stride cycles Mhz convert cycles to MB s 15 213 F 07 400 xe L2 15 213 F 07 8 Page 2 8k 32k 128k 512k Stride words s9 s5 s7 s1 mem 2k 200 0 7 600 2m Slopes of Spatial Locality init data data MAXELEMS Initialize each element in data to 1 Mhz mhz 0 Estimate the clock frequency for size MAXBYTES size MINBYTES size 1 for stride 1 stride MAXSTRIDE stride printf 1f t run size stride Mhz printf n exit 0 L1 800 8m Working set size in bytes Stride in array elements Clock frequency 1000 s13 The array we ll be traversing 1200 s15 Throughput MB sec mountain c Generate the memory mountain define MINBYTES 1 10 Working set size ranges from 1 KB define MAXBYTES 1 23 up to 8 MB define MAXSTRIDE 16 Strides range from 1 to 16 define MAXELEMS MAXBYTES sizeof int int main int size int stride double Mhz Pentium III 550 MHz 16 KB on chip L1 d cache 16 KB on chip L1 i cache 512 KB off chip unified L2 cache The Memory Mountain s11 Memory Mountain Main Routine int data MAXELEMS 15 213 F 07 6 s3 5 Ridges of Temporal Locality Working set size bytes 15 213 F 07 X86 64 Memory Mountain Pentium Nocona Xeon x86 64 3 2 GHz 12 Kuop on chip L1 trace cache 16 KB on chip L1 d cache 1 MB off chip unified L2 cache Read Throughput MB s 6000 Slopes of Spatial Locality Opteron Memory Mountain 5000 3000 AMD Opteron 2 GHZ L1 2500 L1 2000 4000 3000 Read throughput MB s L2 4k 1000 L2 1000 Ridges of Temporal Locality 2000 1500 4k 500 128k 128k 4m s29 s21 s17 Stride words s25 10 s9 s5 s1 15 213 F 07 s13 s25 Mem 0 128m s29 s17 s21 s9 s13 Mem Stride w ords 9 Working Set Size bytes 4m s5 s1 0 Working set bytes 128m Ridges of Temporal Locality A Slope of Spatial Locality Slice through the memory mountain with stride 1 Slice through memory mountain with size 256KB illuminates read throughputs of different caches and memory 15 213 F 07 shows cache block size 800 1200 main memory region L2 cache region L1 cache region 700 read throughput MB s 800 600 400 11 600 500 one access per cache line 400 300 200 working set size bytes s1 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 2m 0 1024k 100 0 4m 200 8m read througput MB s 1000 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride words 15 213 F 07 12 Page 3 15 213 F 07 Matrix Multiplication Example Miss Rate Analysis for Matrix Multiply Major Cache Effects to Consider Assume Total cache size e g use blocking Block size z Exploit spatial locality Description Multiply N x N matrices O N3 total operations Accesses Line size 32B big enough for four 64 bit words Matrix dimension N is very large Cache is not even big enough to hold multiple rows z Exploit temporal locality and keep the working set small z Approximate 1 N as 0 0 Variable sum ijk ijk held in register for i 0 i n i for i 0 i n i for for j 0 j 0 j n j n j j sum sum 0 0 0 0 for for k 0 k 0 k n k n k k sum sum a i k a i k b k j b k j c i j c i j sum sum Analysis Method Look at access pattern of inner loop k i j j k A i C B z N reads per source element z N values summed per destination but may be able to hold in register 13 15 213 F 07 15 213 F 07 14 Layout of C Arrays in Memory review Matrix Multiplication ijk C arrays allocated in rowrow major order each row in contiguous memory locations ijk ijk for for i 0 i 0 i n i n i i for for j 0 j 0 j n j n j j sum sum 0 0 0 0 for for k 0 k 0 k n k n k k sum sum a i k a i k b k j b k j c i j c i j sum sum Stepping through columns in one row for i 0 i N i sum a 0 i accesses …
View Full Document