Slide 1First Exam (This lecture)Research OpportunitiesLast TimeLast TimeLast TimeStrided Access QuestionThe Strided Access Problem (Blackboard?)TodayOptimizations for the Memory HierarchyExample: Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisBlocked Matrix MultiplicationCache Miss AnalysisCache Miss AnalysisSummaryTodayExample C ProgramStatic LinkingWhy Linkers? Modularity!Why Linkers? Efficiency!What Do Linkers Do?What Do Linkers Do? (cont.)Three Kinds of Object Files (Modules)Executable and Linkable Format (ELF)ELF Object File FormatELF Object File Format (cont.)Linker SymbolsResolving SymbolsRelocating Code and DataRelocation Info (main)Relocation Info (swap, .text)Relocation Info (swap, .data)Executable After Relocation (.text)Executable After Relocation (.data)Strong and Weak SymbolsLinker’s Symbol RulesLinker PuzzlesGlobal VariablesPackaging Commonly Used FunctionsSolution: Static LibrariesCreating Static LibrariesCommonly Used LibrariesLinking with Static LibrariesUsing Static LibrariesLoading Executable Object FilesShared LibrariesShared Libraries (cont.)Dynamic Linking at Load-timeDynamic Linking at RuntimeDynamic Linking at Run-timeCarnegie MellonIntroduction to Computer Systems15-213/18-243, spring 200913th Lecture, Feb. 26th Instructors: Gregory Kesden and Markus PüschelCarnegie MellonFirst Exam (This lecture)≥90Carnegie MellonResearch Opportunitieswww.spiral.netResearch:•Interdisciplinary•High performance•Mathematical libraries•Automation source code generationFull-time summer research or honor’s projectExcellent junior or exceptional sophomoreContact: pueschel@eceCarnegie MellonLast TimeMemory hierarchy (Here: Core 2 Duo)DiskMain MemoryL2 unified cacheL1 I-cacheL1 D-cacheCPU Reg2 B/cycle8 B/cycle16 B/cycle 1 B/30 cyclesThroughput:Latency: 100 cycles14 cycles3 cycles millions~4 MB32 KB~4 GB ~500 GBCarnegie MellonLast TimeLocalityTemporal locality: Recently referenced items are likely to be referenced again in the near futureSpatial locality: Items with nearby addresses tend to be referenced close together in timeblockblockCarnegie MellonLast TimeCachesE = 2e lines per setS = 2s sets0 1 2B-1tagvvalid bitB = 2b bytes per cache block (the data)t bits s bits b bitsAddress of word:tagsetindexblockoffsetdata begins at this offsetCarnegie MellonStrided Access QuestionWhat happens if arrays are accessed in two-power strides?Example on the next slideE = 2e lines per setS = 2s setst bits s bits b bitsAddress of word:tagsetindexblockoffsetCarnegie MellonThe Strided Access Problem (Blackboard?)Example: L1 cache, Core 2 Duo32 KB, 8-way associative, 64 byte cache block sizeWhat is S, E, B?Answer: B = 26, E = 23, S = 26.Consider an array of ints accessed at stride 2i, i ≥ 0What is the smallest i such that only one set is used?Answer: i = 10What happens if the stride is 29?Answer: two sets are usedSource of two-power strides?Example: Column access of 2-D arrays (images!)Carnegie MellonTodayProgram optimization:Cache optimizationsLinkingCarnegie MellonOptimizations for the Memory HierarchyWrite code that has localitySpatial: access data contiguouslyTemporal: make sure access to the same data is not too far apart in timeHow to achieve?Proper choice of algorithmLoop transformationsCache versus register level optimization:In both cases locality desirableRegister space much smaller + requires scalar replacement to exploit temporal localityRegister level optimizations include exhibiting instruction level parallelism (conflicts with locality)Carnegie MellonExample: Matrix Multiplicationa bij*c=c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++)for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n+j] += a[i*n + k]*b[k*n + j];}Carnegie MellonCache Miss AnalysisAssume: Matrix elements are doublesCache block = 8 doubles (64 B as in Core 2 Duo)Cache size C << n (much smaller than n)First iteration:n/8 + n = 9n/8 missesAfterwards in cache:(schematic)*=n*=8 wideCarnegie MellonCache Miss AnalysisAssume: Matrix elements are doublesCache block = 8 doublesCache size C << n (much smaller than n)Second iteration:Again:n/8 + n = 9n/8 missesTotal misses:9n/8 * n2 = (9/8) * n3 n*=8 wideCarnegie MellonBlocked Matrix Multiplicationc = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B)for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1];}a bi1j1*c=c+Block size B x BCarnegie MellonCache Miss AnalysisAssume: Cache block = 8 doublesCache size C << n (much smaller than n)Three blocks fit into cache: 3B2 < CFirst (block) iteration:B2/8 misses for each block2n/B * B2/8 = nB/4(omitting matrix c)Afterwards in cache(schematic)*=*=Block size B x Bn/B blocksCarnegie MellonCache Miss AnalysisAssume: Cache block = 8 doublesCache size C << n (much smaller than n)Three blocks fit into cache: 3B2 < CSecond (block) iteration:Same as first iteration2n/B * B2/8 = nB/4Total misses:nB/4 * (n/B)2 = n3/(4B)*=Block size B x Bn/B blocksCarnegie MellonSummaryNo blocking: (9/8) * n3Blocking: 1/(4B) * n3Suggest largest possible block size B, but limit 3B2 < C!(can possibly be relaxed a bit, but there is a limit for B)Reason for dramatic difference:Matrix multiplication has inherent temporal locality:Input data: 3n2, computation 2n3Every array elements used O(n) times!But program has to be written properlyCarnegie MellonTodayProgram optimization:Cache optimizationsLinkingCarnegie MellonExample C Programint buf[2] = {1, 2}; int main() { swap(); return 0;} main.c swap.cextern int buf[]; static int *bufp0 = &buf[0];static int *bufp1;void swap(){ int temp; bufp1 = &buf[1]; temp = *bufp0; *bufp0 = *bufp1; *bufp1 = temp;}Carnegie MellonStatic LinkingPrograms are translated and linked using a compiler driver:unix> gcc -O2 -g -o p main.c swap.cunix>
View Full Document