Getting CPI under 1 Outline More ILP VLIW branch target buffer return address predictor superscalar more register renaming value prediction conditional instructions speculative loads superscalar Limits to ILP Threading fine coarse simultaneous multithreading CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 6 Improving CPI 2 CMSC 411 11 from Patterson Getting CPI below 1 Getting CPI below 1 CPI 1 if issue only 1 instruction every clock cycle Multiple issue processors come in 3 flavors Statically scheduled superscalar processors Dynamically scheduled superscalar processors VLIW very long instruction word processors 2 types of superscalar processors issue varying numbers of instructions per clock Use in order execution if they are statically scheduled or Out of order execution if they are dynamically scheduled VLIW processors in contrast issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction Intel HP Itanium CMSC 411 11 from Patterson 3 VLIW Very Large Instruction Word Each instruction has explicit coding for multiple operations In IA 64 grouping called a packet In Transmeta grouping called a molecule with atoms as ops Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition all the operations the compiler puts in the long instruction word are independent execute in parallel E g 2 integer operations 2 FP ops 2 Memory refs 1 branch 16 to 24 bits per field 7 16 or 112 bits to 7 24 or 168 bits wide Need compiling technique that schedules across several branches 5 CMSC 411 11 from Patterson CS252 S05 4 CMSC 411 11 from Patterson Recall Unrolled Loop that Minimizes Stalls for Scalar 1 Loop 2 3 4 5 6 7 8 9 10 11 12 13 14 L D L D L D L D ADD D ADD D ADD D ADD D S D S D DSUBUI S D BNEZ S D F0 0 R1 F6 8 R1 F10 16 R1 F14 24 R1 F4 F0 F2 F8 F6 F2 F12 F10 F2 F16 F14 F2 0 R1 F4 8 R1 F8 R1 R1 32 16 R1 F12 R1 LOOP 8 R1 F16 L D to ADD D 1 Cycle ADD D to S D 2 Cycles 8 32 24 14 clock cycles or 3 5 per iteration CMSC 411 11 from Patterson 6 Problems with 1st Generation VLIW Loop Unrolling in VLIW Memory reference 1 Memory reference 2 L D F0 0 R1 L D F6 8 R1 FP operation 1 FP op 2 Int op branch Clock 1 L D F10 16 R1 L D F14 24 R1 L D F18 32 R1 L D F22 40 R1 ADD D F4 F0 F2 L D F26 48 R1 ADD D F12 F10 F2 ADD D F20 F18 F2 S D 0 R1 F4 S D 8 R1 F8 ADD D F28 F26 F2 S D 16 R1 F12 S D 24 R1 F16 S D 16 R1 F20 S D 8 R1 F24 S D 0 R1 F28 2 3 4 5 6 7 8 9 ADD D F8 F6 F2 ADD D F16 F14 F2 ADD D F24 F22 F2 DSUBUI R1 R1 48 BNEZ R1 LOOP Increase in code size Generating enough operations in a straight line code fragment requires ambitiously unrolling loops Whenever VLIW instructions are not full unused functional units translate to wasted bits in instruction encoding Unrolled 7 times to avoid delays 7 results in 9 clocks or 1 3 clocks per iteration 1 8X Average 2 5 ops per clock 50 efficiency Note Need more registers in VLIW 15 vs 6 in SS 7 Intel HP IA 64 Explicitly Parallel Instruction Computer EPIC Problems with 1st Generation VLIW Operated in lock step no hazard detection HW Stall in any functional unit pipeline caused entire processor to stall since all functional units must be kept synchronized Caches hard to predict Binary code compatibility IA 64 instruction set architecture 128 64 bit integer regs 128 82 bit floating point regs Not separate register files per functional unit as in old VLIW Hardware checks dependencies interlocks binary compatibility over time Pure VLIW different numbers of functional units and unit latencies require different versions of the code CMSC 411 11 from Patterson 8 CMSC 411 11 from Patterson Predicated execution select 1 out of 64 1 bit flags 40 fewer mispredictions 9 Intel HP IA 64 Explicitly Parallel Instruction Computer EPIC 10 CMSC 411 11 from Patterson IF BW Return Address Predictor Itanium was first implementation 2001 Highly parallel and deeply pipelined hardware at 800Mhz Small buffer of return addresses acts as a stack 6 wide 10 stage pipeline at 800Mhz on 0 18 process Caches most recent return addresses Call Push a return address on stack Itanium 2 is name of 2nd implementation 2005 6 wide 8 stage pipeline at 1666Mhz on 0 13 process Return Pop an address off stack predict as new PC Caches 32 KB I 32 KB D 128 KB L2I 128 KB L2D 9216 KB L3 Figure 2 25 SPEC95 70 Misprediction frequency CMSC 411 11 from Patterson go m88ksim 60 cc1 50 compress 40 xlisp ijpeg 30 perl 20 vortex 10 0 0 1 2 4 8 16 Return address buffer entries CMSC 411 11 from Patterson CS252 S05 11 CMSC 411 11 from Patterson 12 More Instruction Fetch Bandwidth Speculation Register Renaming vs ROB Integrated branch prediction branch predictor is part of instruction fetch unit and is constantly predicting branches Instruction prefetch Instruction fetch units prefetch to deliver multiple instruct per clock integrating it with branch prediction Instruction memory access and buffering Fetching multiple instructions per cycle Alternative to ROB is a larger physical set of registers combined with register renaming Extended registers replace function of both ROB and reservation stations Instruction issue maps names of architectural registers to physical register numbers in extended register set On issue allocates a new unused register for the destination which avoids WAW and WAR hazards Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits Most Out of Order processors today use extended registers with renaming May require accessing multiple cache blocks prefetch to hide cost of crossing cache blocks Provides buffering acting as on demand unit to provide instructions to issue stage as needed and in quantity needed 13 CMSC 411 11 from Patterson Value Prediction CMSC 411 11 from Patterson 14 Conditional Predicated Instructions Attempts to predict value produced by instruction E g Loads a value that changes infrequently Value prediction is useful only if it significantly increases ILP Focus of research has been on loads so so results no processor uses value prediction Related topic is address aliasing prediction RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and simpler since need not actually predict the address values only whether such values conflict Has
View Full Document