Carnegie MellonIntroduction to Computer Systems15-213/18-243, spring 200912thLecture, Feb. 19thInstructors:Gregory Kesden and Markus PüschelCarnegie MellonLast Time Program optimization Optimization blocker: Memory aliasing One solution: Scalar replacement of array accesses that are reusedfor (i = 0; i < n; i++) {b[i] = 0;for (j = 0; j < n; j++)b[i] += a[i*n + j];}for (i = 0; i < n; i++) {double val = 0;for (j = 0; j < n; j++)val += a[i*n + j];b[i] = val;}Carnegie MellonLast Time Instruction level parallelism Latency versus throughputFunctionalUnitsInteger/BranchFPAddFPMult/DivLoad StoreGeneralIntegerStep 11 cycleStep 21 cycleStep 101 cyclelatency cycles/issueInteger Multiply 10 1Carnegie MellonLast Time Consequence**1 d0d1*d2*d3*d4*d5*d6*d7**1 d1d3*d5*d7***1 d0d2*d4*d6Twice as fastCarnegie MellonToday Memory hierarchy, caches, locality Cache organization Program optimization: Cache optimizationsCarnegie MellonProblem: Processor-Memory BottleneckMain MemoryCPU RegProcessor performancedoubled about every 18 monthsBus bandwidthevolved much slowerCore 2 Duo:Can process at least256 Bytes/cycle(1 SSE two operand add and mult)Core 2 Duo:Bandwidth2 Bytes/cycleLatency100 cyclesSolution: CachesCarnegie MellonCache Definition: Computer memory with short access time used for the storage of frequently or recently used instructions or dataCarnegie MellonGeneral Cache Mechanics0 1 2 34 5 6 78 9 10 1112 13 14 158 9 14 3CacheMemoryLarger, slower, cheaper memoryviewed as partitioned into “blocks”Data is copied in block-sized transfer unitsSmaller, faster, more expensivememory caches a subset ofthe blocks444101010Carnegie MellonGeneral Cache Concepts: Hit0 1 2 34 5 6 78 9 10 1112 13 14 158 9 14 3CacheMemoryData in block b is neededRequest: 1414Block b is in cache:Hit!Carnegie MellonGeneral Cache Concepts: Miss0 1 2 34 5 6 78 9 10 1112 13 14 158 9 14 3CacheMemoryData in block b is neededRequest: 12Block b is not in cache:Miss!Block b is fetched frommemoryRequest: 12121212Block b is stored in cache• Placement policy:determines where b goes• Replacement policy:determines which blockgets evicted (victim)Carnegie MellonCache Performance Metrics Miss Rate Fraction of memory references not found in cache (misses / accesses)= 1 – hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1-2 clock cycle for L1 5-20 clock cycles for L2 Miss Penalty Additional time required because of a miss typically 50-200 cycles for main memory (Trend: increasing!)Carnegie MellonLets think about those numbers Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit time of 1 cyclemiss penalty of 100 cycles Average access time:97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles This is why “miss rate” is used instead of “hit rate”Carnegie MellonTypes of Cache Misses Cold (compulsory) miss Occurs on first access to a block Conflict miss Most hardware caches limit blocks to a small subset (sometimes a singleton) of the available cache slots e.g., block i must be placed in slot (i mod 4) Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot e.g., referencing blocks 0, 8, 0, 8, ... would miss every time Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cacheCarnegie MellonWhy Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in timeblockblockCarnegie MellonExample: Locality? Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride-1 pattern Instructions: Temporal: cycle through loop repeatedly Spatial: reference instructions in sequence Being able to assess the locality of code is a crucial skill for a programmersum = 0;for (i = 0; i < n; i++)sum += a[i];return sum;Carnegie MellonLocality Example #1int sum_array_rows(int a[M][N]){int i, j, sum = 0;for (i = 0; i < M; i++)for (j = 0; j < N; j++)sum += a[i][j];return sum;}Carnegie MellonLocality Example #2int sum_array_cols(int a[M][N]){int i, j, sum = 0;for (j = 0; j < N; j++)for (i = 0; i < M; i++)sum += a[i][j];return sum;}Carnegie MellonLocality Example #3int sum_array_3d(int a[M][N][N]){int i, j, k, sum = 0;for (i = 0; i < M; i++)for (j = 0; j < N; j++)for (k = 0; k < N; k++)sum += a[k][i][j];return sum;} How can it be fixed?Carnegie MellonMemory Hierarchies Some fundamental and enduring properties of hardware and software systems: Faster storage technologies almost always cost more per byte and have lower capacity The gaps between memory technology speeds are widening True of registers ↔ DRAM, DRAM ↔ disk, etc. Well-written programs tend to exhibit good locality These properties complement each other beautifully They suggest an approach for organizing memory and storage systems known as a memory hierarchyCarnegie MellonAn Example Memory Hierarchyregisterson-chip L1cache (SRAM)main memory(DRAM)local secondary storage(local disks)Larger, slower, cheaper per byteremote secondary storage(tapes, distributed file systems, Web servers)Local disks hold files retrieved from disks on remote network serversMain memory holds disk blocks retrieved from local disksoff-chip L2cache (SRAM)L1 cache holds cache lines retrieved from L2 cacheCPU registers hold words retrieved fromL1 cacheL2 cache holds cache lines retrieved from main memoryL0:L1:L2:L3:L4:L5:Smaller,faster,costlierper byteCarnegie MellonExamples of Caching in the HierarchyHardware0On-Chip TLBAddress translationsTLBWeb browser10,000,000Local diskWeb pagesBrowser cacheWeb cacheNetwork buffer cacheBuffer cacheVirtual MemoryL2 cacheL1 cacheRegistersCache TypeWeb pagesParts of filesParts of files4-KB page64-bytes block64-bytes block4-byte wordsWhat is Cached?Web proxy server1,000,000,000Remote server disksOS100Main
View Full Document