Carnegie Mellon Introduction to Computer Systems 15 213 18 243 spring 2009 11th Lecture Feb 17th Instructors Gregory Kesden and Markus P schel Carnegie Mellon Last Time Memory layout FF Buffer overflow worms and viruses Stack foo stack frame B 80 Heap 08 00 Data Text data written by gets B pad exploit code bar stack frame Carnegie Mellon Last Time Program Optimization 160x O Carnegie Mellon Last Time Program optimization Overview Removing unnecessary procedure calls Code motion precomputation Strength reduction Sharing of common subexpressions Optimization blocker Procedure calls for i 0 i n i get vec element v i val res val void lower char s int i for i 0 i strlen s i if s i A s i Z s i A a Carnegie Mellon Today Program optimization Optimization blocker Memory aliasing Out of order processing Instruction level parallelism Understanding branch prediction Carnegie Mellon Optimization Blocker Memory Aliasing Sums rows of n x n matrix a and stores in vector b void sum rows1 double a double b long n long i j for i 0 i n i b i 0 for j 0 j n j b i a i n j sum rows1 inner loop L53 addsd rcx xmm0 addq 8 rcx decq rax movsd xmm0 rsi r8 8 jne L53 b a FP add FP store Code updates b i memory access on every iteration Why couldn t compiler optimize this away Carnegie Mellon Reason If memory is accessed compiler assumes the possibility of side effects Example Sums rows of n x n matrix a and stores in vector b void sum rows1 double a double b long n long i j for i 0 i n i b i 0 for j 0 j n j b i a i n j Value of B double A 9 0 1 2 4 8 16 32 64 128 init 4 8 16 i 0 3 8 16 double B 3 A 3 i 1 3 22 16 sum rows1 A B 3 i 2 3 22 224 Carnegie Mellon Removing Aliasing Sums rows of n x n matrix a and stores in vector b void sum rows2 double a double b long n long i j for i 0 i n i double val 0 for j 0 j n j val a i n j b i val sum rows2 inner loop L66 addsd rcx xmm0 addq 8 rcx decq rax jne L66 FP Add Scalar replacement Copy array elements that are reused into temporary variables Assumes no memory aliasing otherwise possibly incorrect Carnegie Mellon Unaliased Version When Aliasing Happens Sum rows is of n X n matrix a and store in vector b void sum rows2 double a double b long n long i j for i 0 i n i double val 0 for j 0 j n j val a i n j b i val double A 9 0 1 2 4 8 16 32 64 128 Value of B 4 8 16 i 0 3 8 16 double B 3 A 3 i 1 3 27 16 sum rows1 A B 3 i 2 3 27 224 Aliasing still creates interference Result different than before init Carnegie Mellon Optimization Blocker Memory Aliasing Memory aliasing Two different memory references write to the same location Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures Hard to analyze compiler cannot figure it out Hence is conservative Solution Scalar replacement in innermost loop Copy memory variables that are reused into local variables Basic scheme Load t1 a i t2 b i 1 Compute t4 t1 t2 Store a i t12 b i 1 t7 Carnegie Mellon More Difficult Example Matrix multiplication C A B C c double calloc sizeof double n n Multiply n x n matrices a and b void mmm double a double b double c int n int i j k for i 0 i n i for j 0 j n j for k 0 k n k c i n j a i n k b k n j j c a i b c Which array elements are reused All of them But how to take advantage Carnegie Mellon Step 1 Blocking Here 2 x 2 Blocking also called tiling partial unrolling loop exchange Assumes associativity compiler will never do it c double calloc sizeof double n n Multiply n x n matrices a and b void mmm double a double b double c int n int i j k for i 0 i n i 2 for j 0 j n j 2 for k 0 k n k 2 for i1 i i1 i 2 i1 for j1 j j1 j 2 j1 for k1 k k1 k 2 k1 c i1 n j1 a i1 n k1 b k1 n j1 j1 c a i1 b c Carnegie Mellon Step 2 Unrolling Inner Loops c double calloc sizeof double n n Multiply n x n matrices a and b void mmm double a double b double c int n int i j k for i 0 i n i 2 for j 0 j n j 2 for k 0 k n k 2 body body c i n j c i 1 n j a i n k b k n j a i n k 1 b k 1 n j c i n j a i 1 n k b k n j a i 1 n k 1 b k 1 n j c i 1 n j c i n j 1 a i n k b k n j 1 a i n k 1 b k 1 n j 1 c i n j 1 c i 1 n j 1 a i 1 n k b k n j 1 a i 1 n k 1 b k 1 n j 1 c i 1 n j 1 Every array element a b c used twice Now scalar replacement can be applied Carnegie Mellon Today Program optimization Optimization blocker Memory aliasing Out of order processing Instruction level parallelism Understanding branch prediction Carnegie Mellon Example Compute Factorials int rfact int n if n 1 return 1 return n rfact n 1 int fact int n int i int result 1 for i n i 0 i result result i return result Machines Intel Pentium 4 Nocona 3 2 GHz Fish Machines Intel Core 2 2 7 GHz Compiler Versions GCC 3 4 2 current on Fish machines Cycles per element or per mult Machine Nocona Core 2 rfact 15 5 6 0 fact 10 0 3 0 Something changed from Pentium 4 to Core Details later Carnegie Mellon Optimization 1 Loop Unrolling int fact u3a int n int i int result 1 for i n i 3 i 3 result result i i 1 i 2 for i 0 i result i return result Cycles per element or per mult Machine Nocona Core 2 Compute more values per iteration Does not help here Why Branch prediction details later rfact 15 5 6 0 fact 10 0 …
View Full Document