Slide 1Last TimeLast TimeLast TimeTodayOptimization Blocker: Memory AliasingReasonRemoving AliasingUnaliased Version When Aliasing HappensOptimization Blocker: Memory AliasingMore Difficult ExampleStep 1: Blocking (Here: 2 x 2)Step 2: Unrolling Inner LoopsTodayExample: Compute FactorialsOptimization 1: Loop UnrollingOptimization 2: Multiple AccumulatorsModern CPU DesignSuperscalar ProcessorPentium 4 Nocona CPULatency versus ThroughputHard BoundsPerformance in Numerical ComputingNocona vs. Core 2Instruction ControlTranslating into Micro-OperationsTraditional View of Instruction ExecutionDataflow View of Instruction ExecutionExample ComputationCycles Per Element (CPE)x86-64 Compilation of Combine4Combine4 = Serial Computation (OP = *)Loop UnrollingEffect of Loop UnrollingLoop Unrolling with ReassociationEffect of ReassociationReassociated ComputationLoop Unrolling with Separate AccumulatorsEffect of Separate AccumulatorsSeparate AccumulatorsUnrolling & AccumulatingUnrolling & Accumulating: Intel FP *Unrolling & Accumulating: Intel FP +Unrolling & Accumulating: Intel Int *Unrolling & Accumulating: Intel Int +FP *: Nocona versus Core 2Can We Go Faster?TodayWhat About Branches?Branch OutcomesBranch PredictionBranch Prediction Through LoopBranch Misprediction InvalidationBranch Misprediction RecoveryDetermining Misprediction PenaltyForcing ConditionalTesting MethodologyTesting OutcomesGetting High Performance So FarCarnegie MellonIntroduction to Computer Systems15-213/18-243, spring 200911th Lecture, Feb. 17th Instructors: Gregory Kesden and Markus PüschelCarnegie MellonLast TimeMemory layoutBuffer overflow, worms, and virusesFF00StackTextDataHeap0880Bfoo stack framebar stack frameBexploitcodepaddata writtenby gets()Carnegie MellonLast TimeProgram Optimization-O !160xCarnegie MellonLast TimeProgram optimizationOverviewRemoving unnecessary procedure callsCode motion/precomputationStrength reductionSharing of common subexpressionsOptimization blocker: Procedure callsfor (i = 0; i < n; i++) { get_vec_element(v, i, &val); *res += val;}void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}Carnegie MellonTodayProgram optimizationOptimization blocker: Memory aliasingOut of order processing: Instruction level parallelismUnderstanding branch predictionCarnegie MellonOptimization Blocker: Memory AliasingCode updates b[i] (= memory access) on every iterationWhy couldn’t compiler optimize this away?# sum_rows1 inner loop.L53:addsd (%rcx), %xmm0 # FP addaddq $8, %rcxdecq %raxmovsd %xmm0, (%rsi,%r8,8) # FP storejne .L53/* Sums rows of n x n matrix a and stores in vector b */void sum_rows1(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {b[i] = 0;for (j = 0; j < n; j++) b[i] += a[i*n + j]; }}abΣCarnegie MellonReasonIf memory is accessed, compiler assumes the possibility of side effectsExample:double A[9] = { 0, 1, 2, 4, 8, 16}, 32, 64, 128};double B[3] = A+3;sum_rows1(A, B, 3);i = 0: [3, 8, 16]init: [4, 8, 16]i = 1: [3, 22, 16]i = 2: [3, 22, 224]Value of B:/* Sums rows of n x n matrix a and stores in vector b */void sum_rows1(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {b[i] = 0;for (j = 0; j < n; j++) b[i] += a[i*n + j]; }}Carnegie MellonRemoving AliasingScalar replacement:Copy array elements that are reused into temporary variablesAssumes no memory aliasing (otherwise possibly incorrect)# sum_rows2 inner loop.L66:addsd (%rcx), %xmm0 # FP Addaddq $8, %rcxdecq %raxjne .L66/* Sums rows of n x n matrix a and stores in vector b */void sum_rows2(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {double val = 0;for (j = 0; j < n; j++) val += a[i*n + j]; b[i] = val; }}Carnegie MellonUnaliased Version When Aliasing HappensAliasing still creates interferenceResult different than beforedouble A[9] = { 0, 1, 2, 4, 8, 16}, 32, 64, 128};double B[3] = A+3;sum_rows1(A, B, 3);i = 0: [3, 8, 16]init: [4, 8, 16]i = 1: [3, 27, 16]i = 2: [3, 27, 224]Value of B:/* Sum rows is of n X n matrix a and store in vector b */void sum_rows2(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) {double val = 0;for (j = 0; j < n; j++) val += a[i*n + j]; b[i] = val; }}Carnegie MellonOptimization Blocker: Memory AliasingMemory aliasing: Two different memory references writeto the same locationEasy to have happen in C Since allowed to do address arithmetic Direct access to storage structuresHard to analyze = compiler cannot figure it outHence is conservativeSolution: Scalar replacement in innermost loopCopy memory variables that are reused into local variablesBasic scheme:Load: t1 = a[i], t2 = b[i+1], ….Compute: t4 = t1 * t2; ….Store: a[i] = t12, b[i+1] = t7, …Carnegie MellonMore Difficult ExampleMatrix multiplication: C = A*B + CWhich array elements are reused?All of them! But how to take advantage?c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++)for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n+j] += a[i*n + k]*b[k*n + j];}a bij*c=c+Carnegie MellonStep 1: Blocking (Here: 2 x 2)Blocking, also called tiling = partial unrolling + loop exchangeAssumes associativity (= compiler will never do it)c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=2)for (j = 0; j < n; j+=2) for (k = 0; k < n; k+=2) for (i1 = i; i1 < i+2; i1++) for (j1 = j; j1 < j+2; j1++) for (k1 = k; k1 < k+2; k1++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1];}a bi1j1*c=c+Carnegie MellonStep 2: Unrolling Inner Loopsc = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=2)for (j = 0; j < n; j+=2) for (k = 0; k < n; k+=2) <body>}Every array element a[…], b[…],c[…] used twiceNow scalar replacement can be applied<body>c[i*n + j] = a[i*n + k]*b[k*n + j] + a[i*n + k+1]*b[(k+1)*n + j]
View Full Document