Unformatted text preview:

Systems I Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Cache Performance Metrics Miss Rate Fraction of memory references not found in cache misses references Typical numbers 3 10 for L1 can be quite small e g 1 for L2 depending on size etc Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers 1 3 clock cycle for L1 5 12 clock cycles for L2 Miss Penalty Additional time required because of a miss Typically 100 300 cycles for main memory 2 Writing Cache Friendly Code Repeated references to variables are good temporal locality Stride 1 reference patterns are good spatial locality Examples cold cache 4 byte words 4 word cache blocks int sumarrayrows int a M N int i j sum 0 int sumarraycols int a M N int i j sum 0 for i 0 i M i for j 0 j N j sum a i j return sum for j 0 j N j for i 0 i M i sum a i j return sum Miss rate 1 4 25 Miss rate 100 3 The Memory Mountain Empirical studies of memory system behavior Read throughput read bandwidth Number of bytes read from memory per second MB s Memory mountain Measured read throughput as a function of spatial and temporal locality Compact way to characterize memory system performance 4 Memory Mountain Test Function The test function void test int elems int stride int i result 0 volatile int sink for i 0 i elems i stride result data i sink result So compiler doesn t optimize away the loop Run test elems stride and return read throughput MB s double run int size int stride double Mhz double cycles int elems size sizeof int test elems stride warm up the cache cycles fcyc2 test elems stride 0 call test elems stride return size stride cycles Mhz convert cycles to MB s 5 Memory Mountain Main Routine mountain c Generate the memory mountain define MINBYTES 1 10 Working set size ranges from 1 KB define MAXBYTES 1 23 up to 8 MB define MAXSTRIDE 16 Strides range from 1 to 16 define MAXELEMS MAXBYTES sizeof int int data MAXELEMS int main int size int stride double Mhz The array we ll be traversing Working set size in bytes Stride in array elements Clock frequency init data data MAXELEMS Initialize each element in data to 1 Mhz mhz 0 Estimate the clock frequency for size MAXBYTES size MINBYTES size 1 for stride 1 stride MAXSTRIDE stride printf 1f t run size stride Mhz printf n exit 0 6 The Memory Mountain Pentium III Xeon 550 MHz 16 KB on chip L1 d cache 16 KB on chip L1 i cache 512 KB off chip unified L2 cache 1000 L1 800 600 400 L2 200 2k 8k 32k 128k 512k 2m s15 s13 s11 s7 s9 stride words mem s5 s3 0 s1 Slopes of Spatial Locality Ridges of Temporal Locality xe 8m read throughput MB s 1200 working set size bytes 7 Ridges of Temporal Locality Slice through the memory mountain with stride 1 illuminates read throughputs of different caches and memory 1200 main memory region L2 cache region L1 cache region 1000 800 600 400 200 working set size bytes 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1024k 2m 4m 0 8m read througput MB s 8 A Slope of Spatial Locality Slice through memory mountain with size 256KB shows cache block size 800 700 read throughput MB s 600 500 one access per cache line 400 300 200 100 0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride words 9 Matrix Multiplication Example Major Cache Effects to Consider Total cache size Exploit temporal locality and keep the working set small e g by using blocking Block size Exploit spatial locality Description Multiply N x N matrices O N3 total operations Accesses ijk Variable sum for i 0 i n i held in register for j 0 j n j sum 0 0 for k 0 k n k sum a i k b k j c i j sum N reads per source element N values summed per destination but may be able to hold in register 10 Miss Rate Analysis for Matrix Multiply Assume Line size 32B big enough for 4 64 bit words Matrix dimension N is very large Approximate 1 N as 0 0 Cache is not even big enough to hold multiple rows Analysis Method Look at access pattern of inner loop k i j j k A i B C 11 Layout of C Arrays in Memory review C arrays allocated in row major order each row in contiguous memory locations Stepping through columns in one row for i 0 i N i sum a 0 i accesses successive elements if block size B 4 bytes exploit spatial locality compulsory miss rate 4 bytes B Stepping through rows in one column for i 0 i n i sum a i 0 accesses distant elements no spatial locality compulsory miss rate 1 i e 100 12 Matrix Multiplication ijk ijk for i 0 i n i for j 0 j n j sum 0 0 for k 0 k n k sum a i k b k j c i j sum Inner loop j i j i A B Row wise Columnwise C Fixed Misses per Inner Loop Iteration A B C 0 25 1 0 0 0 13 Matrix Multiplication jik jik for j 0 j n j for i 0 i n i sum 0 0 for k 0 k n k sum a i k b k j c i j sum Misses per Inner Loop Iteration A B C 0 25 1 0 0 0 Inner loop j i j i A B Row wise Columnwise C Fixed 14 Matrix Multiplication kij kij for k 0 k n k for i 0 i n i r a i k for j 0 j n j c i j r b k j Inner loop i k A Fixed k B i C Row wise Row wise Misses per Inner Loop Iteration A B C 0 0 0 25 0 25 15 Matrix Multiplication ikj ikj for i 0 i n i for k 0 k n k r a i k for j 0 j n j c i j r b k j Inner loop i k A Fixed k B i C Row wise Row wise Misses per Inner Loop Iteration A B C 0 0 0 25 0 25 16 Matrix Multiplication jki jki for j 0 j n j for k 0 k n k r b k j for i 0 i n i c i j a i k r Inner loop k j k j A Column wise B Fixed C Columnwise Misses per Inner Loop Iteration A B C 1 0 0 0 1 0 17 Matrix Multiplication kji kji for k 0 k …


View Full Document

UT CS 429H - Optimizing for the Memory Hierarchy

Loading Unlocking...
Login

Join to view Optimizing for the Memory Hierarchy and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Optimizing for the Memory Hierarchy and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?