1Uniprocessor Optimizations Arvind KrishnamurthyFall 2004Idealized Uniprocessor Model Processor names bytes, words, etc. in its address space These represent integers, floats, pointers, arrays, etc. Exist in the program stack, static region, or heap Operations include Read and write (given an address/pointer) Arithmetic and other logical operations Order specified by program Read returns the most recently written data Compiler and architecture translate high level expressions into “obvious”lower level instructions Hardware executes instructions in order specified by compiler Cost Each operation has the same cost(read, write, add, multiply, etc.)2Uniprocessors in the real world Real processors have registers and caches small amounts of fast memory store values of recently used or nearby data different memory ops can have very different costs parallelism multiple “functional units” that can run in parallel different orders, different instruction mixes have different costs pipelining a form of parallelism, like an assembly line in a factory Why is this your problem?In theory, compilers understand all of this and can optimize your program; in practice they don’t. In this example: Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.3 hours Pipelining helps throughput, but not latency Pipeline rate limited by slowest pipeline stage Potential speedup = Number pipe stages Time to “fill” pipeline and time to “drain”reduces speedup (overhead)ABCD6 PM789TaskOrderTime30 40 40 40 40 204 people doing laundry: wash (30 min) + dry (40 min) + fold (20 min)Pipelining Review3Example: 5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUX MUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB Data• Pipelining is also used within arithmetic units– a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUXLimits to Instruction Level Parallelism (ILP) Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single dryer) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). The hardware and compiler will try to reduce hazards Reordering instructions, multiple issue, dynamic branch prediction, speculative execution… You can also enable parallelism by careful coding4Dependences Limit Parallelism A dependence or data hazard is one of the following: true or flow dependence: X writes a location that Y later reads (read-after write or RAW hazard) anti-dependence X reads a location that Y later writes (write-after-read or WAR hazard) output dependence X writes a location that Y later writes (write-after-write or WAW hazard)a = 1= aa = a = = aa = 3.4*b/12.2outputantitrue Most programs have a high degree of locality in their accesses spatial locality: accessing things nearby previous accesses temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit localityon-chip cacheregistersdatapathcontrolprocessorSecond level cache (SRAM)Main memory(DRAM)Secondary storage (Disk)Tertiary storage(Disk/Tape)Speed (ns): 1 10 100 10 ms 10 secSize (bytes): 100s Ks Ms Gs Ts Memory Hierarchy5µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moore’s Law” Memory hierarchies are getting deeper Processors get faster more quickly than memoryProcessor-DRAM Gap (latency)Tag Values Cache contains tag (upper bits of address) and values Cache hit: if tag matches—cheap Cache miss: non-cached memory access—expensive Consider cache with 4 entries and 4 words per entry• Cache line length: # of bytes loaded together in one entry• Provides spatial locality• Associativity• direct-mapped: only one address (line) in a given range in cache• n-way: 2 or more lines with different addresses existCache Basics1010Address:10100110X6 Microbenchmark for memory system performancetime the following program for each size(A) and stride s(repeat to obtain confidence and mitigate timer resolution)for array A of size from 4KB to 8MB by 2xfor stride s from 8 Bytes (1 word) to size(A)/2 by 2xfor i from 0 to size by sload A[i] from memory (4 Bytes)Experimental Study of MemorySee lambda.cs.yale.edu/~arvind/papers/t3d-isca95.ps for detailsArray sizeMemory Hierarchy of a Sun Ultra-IIi(333MHz)7 Actual performance of a simple program can be a complicated function of the architecture Slight changes in the architecture or program change the performance significantly To write fast programs, need to (sometimes) consider architecture We would like simple models to help us design efficient algorithms Is this possible? We will illustrate with a common technique for improving cache performance Idea: use divide-and-conquer to define a problem that fits in register/L1-cache/L2-cacheSome Architecture Lessons A matrix is a 2-D array of elements, but memory addresses are “1-D” Conventions for matrix layout by column, or “column major” (Fortran default) by row, or “row major” (C default)012345678910111213141516171819048121615913172610141837111519Column majorRow majorMatrix Storage8 Assume just 2 levels in the hierarchy, fast and slow All data initially in slow memory m = number of memory elements (words) moved between fast and slow memory tm= time per slow memory operation f = number of arithmetic operations tf= time per arithmetic operation << tm q = f / m average number of flops per slow element access Minimum possible time = f* tfwhen all data in fast memory Actual time Larger q means Time closer to minimum f * tfSimple Model of MemoryKey to algorithm efficiencyf * tf+ m * tm= f * tf* (1
View Full Document