Carnegie MellonIntroduction to Computer Systems15-213/18-243, spring 200911thLecture, Feb. 17thInstructors:Gregory Kesden and Markus PüschelCarnegie MellonLast Time Memory layout Buffer overflow, worms, and virusesFF00StackTextDataHeap0880Bfoo stack framebar stack frameBexploitcodepaddata writtenby gets()Carnegie MellonLast Time Program Optimization-O !160xCarnegie MellonLast Time Program optimization Overview Removing unnecessary procedure calls Code motion/precomputation Strength reduction Sharing of common subexpressions Optimization blocker: Procedure callsfor (i = 0; i < n; i++) {get_vec_element(v, i, &val);*res += val;}void lower(char *s){int i;for (i = 0; i < strlen(s); i++)if (s[i] >= 'A' && s[i] <= 'Z')s[i] -= ('A' - 'a');}Carnegie MellonToday Program optimization Optimization blocker: Memory aliasing Out of order processing: Instruction level parallelism Understanding branch predictionCarnegie MellonOptimization Blocker: Memory Aliasing Code updates b[i] (= memory access) on every iteration Why couldn’t compiler optimize this away?# sum_rows1 inner loop.L53:addsd (%rcx), %xmm0 # FP addaddq $8, %rcxdecq %raxmovsd %xmm0, (%rsi,%r8,8) # FP storejne .L53/* Sums rows of n x n matrix a and stores in vector b */void sum_rows1(double *a, double *b, long n) {long i, j;for (i = 0; i < n; i++) {b[i] = 0;for (j = 0; j < n; j++)b[i] += a[i*n + j];}}abΣCarnegie MellonReason If memory is accessed, compiler assumes the possibility of side effects Example:double A[9] = { 0, 1, 2,4, 8, 16},32, 64, 128};double B[3] = A+3;sum_rows1(A, B, 3);i = 0: [3, 8, 16]init: [4, 8, 16]i = 1: [3, 22, 16]i = 2: [3, 22, 224]Value of B:/* Sums rows of n x n matrix a and stores in vector b */void sum_rows1(double *a, double *b, long n) {long i, j;for (i = 0; i < n; i++) {b[i] = 0;for (j = 0; j < n; j++)b[i] += a[i*n + j];}}Carnegie MellonRemoving Aliasing Scalar replacement: Copy array elements that are reused into temporary variables Assumes no memory aliasing (otherwise possibly incorrect)# sum_rows2 inner loop.L66:addsd (%rcx), %xmm0 # FP Addaddq $8, %rcxdecq %raxjne .L66/* Sums rows of n x n matrix aand stores in vector b */void sum_rows2(double *a, double *b, long n) {long i, j;for (i = 0; i < n; i++) {double val = 0;for (j = 0; j < n; j++)val += a[i*n + j];b[i] = val;}}Carnegie MellonUnaliased Version When Aliasing Happens Aliasing still creates interference Result different than beforedouble A[9] = { 0, 1, 2,4, 8, 16},32, 64, 128};double B[3] = A+3;sum_rows1(A, B, 3);i = 0: [3, 8, 16]init: [4, 8, 16]i = 1: [3, 27, 16]i = 2: [3, 27, 224]Value of B:/* Sum rows is of n X n matrix aand store in vector b */void sum_rows2(double *a, double *b, long n) {long i, j;for (i = 0; i < n; i++) {double val = 0;for (j = 0; j < n; j++)val += a[i*n + j];b[i] = val;}}Carnegie MellonOptimization Blocker: Memory Aliasing Memory aliasing: Two different memory references writeto the same location Easy to have happen in C Since allowed to do address arithmetic Direct access to storage structures Hard to analyze = compiler cannot figure it out Hence is conservative Solution: Scalar replacement in innermost loop Copy memory variables that are reused into local variables Basic scheme: Load: t1 = a[i], t2 = b[i+1], …. Compute: t4 = t1 * t2; …. Store: a[i] = t12, b[i+1] = t7, …Carnegie MellonMore Difficult Example Matrix multiplication: C = A*B + C Which array elements are reused? All of them! But how to take advantage?c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {int i, j, k;for (i = 0; i < n; i++)for (j = 0; j < n; j++)for (k = 0; k < n; k++)c[i*n+j] += a[i*n + k]*b[k*n + j];}a bij*c=c+Carnegie MellonStep 1: Blocking (Here: 2 x 2) Blocking, also called tiling = partial unrolling + loop exchange Assumes associativity (= compiler will never do it)c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {int i, j, k;for (i = 0; i < n; i+=2)for (j = 0; j < n; j+=2)for (k = 0; k < n; k+=2)for (i1 = i; i1 < i+2; i1++)for (j1 = j; j1 < j+2; j1++)for (k1 = k; k1 < k+2; k1++)c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1];}a bi1j1*c=c+Carnegie MellonStep 2: Unrolling Inner Loopsc = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) {int i, j, k;for (i = 0; i < n; i+=2)for (j = 0; j < n; j+=2)for (k = 0; k < n; k+=2)<body>} Every array element a[…], b[…],c[…] used twice Now scalar replacement can be applied<body>c[i*n + j] = a[i*n + k]*b[k*n + j] + a[i*n + k+1]*b[(k+1)*n + j] + c[i*n + j]c[(i+1)*n + j] = a[(i+1)*n + k]*b[k*n + j] + a[(i+1)*n + k+1]*b[(k+1)*n + j] + c[(i+1)*n + j]c[i*n + (j+1)] = a[i*n + k]*b[k*n + (j+1)] + a[i*n + k+1]*b[(k+1)*n + (j+1)] + c[i*n + (j+1)]c[(i+1)*n + (j+1)] = a[(i+1)*n + k]*b[k*n + (j+1)] + a[(i+1)*n + k+1]*b[(k+1)*n + (j+1)] + c[(i+1)*n + (j+1)]Carnegie MellonToday Program optimization Optimization blocker: Memory aliasing Out of order processing: Instruction level parallelism Understanding branch predictionCarnegie MellonExample: Compute Factorials Machines Intel Pentium 4 Nocona, 3.2 GHz Fish Machines Intel Core 2, 2.7 GHz Compiler Versions GCC 3.4.2 (current on Fish machines)int rfact(int n){if (n <= 1)return 1;return n * rfact(n-1);}int fact(int n){int i;int result = 1;for (i = n; i > 0; i--)result = result * i;return result;}Machine Nocona Core 2rfact15.5 6.0fact10.0 3.0Cycles per element (or per mult)Something changed from Pentium 4 to Core: Details laterCarnegie MellonOptimization 1: Loop Unrolling Compute more values per iteration Does not help here Why? Branch prediction – details laterint fact_u3a(int n){int i;int result = 1;for (i = n; i >= 3; i-=3) {result =result * i * (i-1) * (i-2);}for (; i > 0; i--)result *= i;return result;}Machine Nocona Core 2rfact 15.5 6.0fact 10.0 3.0fact_u3a 10.0 3.0Cycles per element (or per mult)Carnegie MellonOptimization 2: Multiple Accumulators That seems to help. Can one get even faster? Explanation: instruction level parallelism – details laterint fact_u3b(int n){int i;int result0 = 1;int result1 = 1;int result2 = 1;for (i = n; i >= 3; i-=3) {result0 *= i;result1 *= (i-1);result2 *= (i-2);}for (;
View Full Document