Administrivia Homework 3 due Thursday March 26 Finish Chapter 3 of H P H P and start Appendix C CMSC 411 Computer Systems Architecture Lecture 11a Instruction Level Parallelism cont Alan Sussman als cs umd edu l d d 2 CMSC 411 11a from Patterson Memory reference speculation with pp hardware support Superscalar execution To move loads across stores when compiler can t be sure it is legal Use U a speculative l ti load l d instruction i t ti Predication helps with scheduling Example superscalar that can issue 1 memory reference and 1 ALU op per cycle or just 1 branch hardware saves address of memory location if a subsequent store changes that location before the check to end p then the speculation p failed otherwise it succeeded the speculation on failure need to redo load and re execute all speculated instructions after the speculative load 1st instruction 2nd instruction LW R1 40 R2 ADD R3 R4 R5 ADD R6 R3 R7 BEQZ R10 R10 L L LW R8 0 R10 LW R9 0 R8 R9 0 R8 CMSC 411 11a from Patterson 3 LWC loads if 3rd operand not 0 1st instruction 2nd instruction LW R1 40 R2 ADD R3 R4 R5 LWC R8 0 R10 R10 R8 0 R10 R10 ADD R6 R3 R7 BEQZ R10 L LW R9 0 R8 CMSC 411 11a from Patterson 4 Limits to ILP Conflicting studies of amount Benchmarks vectorized Fortran FP vs vs integer C programs Hardware sophistication Compiler sophistication Limitations to ILP How much ILP is available using existing mechanisms with increasing HW budgets Do we need to invent new HW SW mechanisms to keep on processor performance curve Intel MMX SSE Streaming SIMD Extensions 64 bit ints Intel SSE2 128 bit including 2 64 bit Fl Pt per clock Motorola AltiVec 128 bit ints and FPs Supersparc Multimedia ops etc CMSC 411 11a from Patterson 6 Limits to ILP Overcoming Limits Initial HW Model here MIPS compilers Advances in compiler technology significantly new and different hardware techniques q may y be able to overcome limitations assumed in studies Assumptions for ideal perfect machine to start 1 Register renaming infinite f virtual registers all register WAW WAR hazards are avoided 2 Branch prediction perfect no mispredictions 3 Jump prediction all jumps perfectly predicted returns case statements 2 3 no control dependencies perfect speculation an unbounded buffer of instructions available However unlikely such advances when coupled with realistic hardware will overcome these limits in near future 4 Memory address 4 Memory address alias analysis addresses known a load can be moved before a store provided addresses not equal 1 4 eliminates all but RAW Also perfect caches 1 cycle latency for all instructions FP unlimited instructions issued clock cycle CMSC 411 11a from Patterson 7 CMSC 411 11a from Patterson 8 Limits to ILP HW Model comparison Power 5 Instructions st uct o s Issued ssued per clock Instruction Window Size Renaming Registers Infinite te 4 Infinite 200 Branch Prediction Perfect Infinite Cache Perfect Memory Alias Analysis Perfect Figure 3 1 SPEC92 In nstruc ctions s Per Clock k Model Upper Limit to ILP Ideal Machine 48 integer 40 Fl Pt 2 to 6 misprediction Tournament Branch Predictor 64KI 32KD 1 92MB L2 36 MB L3 160 150 1 150 1 FP 75 150 140 Integer 18 60 120 118 7 100 75 2 80 62 6 54 8 60 40 17 9 20 0 gcc espresso li fpppp doducd tomcatv Programs 9 CMSC 411 11a from Patterson Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock Instruction Window Size Infinite Infinite 4 Infinite 2K 512 128 32 Infinite Renaming Registers Infinite Infinite 10 CMSC 411 11a from Patterson More Realistic HW Window Impact Figure 3 2 SPEC92 Change from Infinite window 2048 512 128 32 FP 9 150 160 140 Perfect Cache Perfect Memory Alias Perfect Perfect Perfect Perfect CMSC 411 11a from Patterson Instructio ons Per Clo ock Branch Prediction 150 IP PC 200 48 integer 40 Fl Pt 2 to 6 misprediction Tournament Branch Predictor 64KI 32KD 1 92MB L2 36 MB L3 119 120 Integer 8 63 100 75 80 63 60 40 20 61 55 60 59 49 36 1010 8 41 1513 45 34 35 8 1815 1211 9 1615 14 14 9 0 gcc espresso li Inf inite 11 2048 CMSC 411 11a from Patterson f pppp 512 128 doduc tomcatv 32 12 More Realistic HW Branch Impact Limits to ILP HW Model comparison Figure 3 3 SPEC92 Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle 60 New Model Model Power 5 50 Instructions 64 Issued per clock Instruction 2048 Window Size Infinite Renaming Registers Infinite Cache Memory Alias 60 58 FP 15 45 48 46 46 45 45 45 41 40 35 Perfect vs 8K Tournament vs 512 12 2 2 bit vs profile vs none Perfect Perfect Perfect Perfect 200 30 IP PC Infinite Infinite Branch Prediction 4 61 48 integer 40 Fl Pt Perfect 19 20 16 15 12 2 to 6 misprediction Tournament Branch Predictor 64KI 32KD 1 92MB L2 36 MB L3 13 14 10 10 9 6 7 6 6 6 7 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Perfect Selective predictor Tournament 13 CMSC 411 11a from Patterson 29 Integer 6 12 Standard 2 bit BHT 512 Static Profile None No prediction 14 CMSC 411 11a from Patterson Limits to ILP HW Model comparison Misprediction Rates Figure 3 4 SPEC92 New Model 35 Model Power 5 Infinite 4 Infinite 200 30 Mis sprediction Rate 30 Instructions 64 Issued per clock Instruction 2048 Window Size 23 25 18 20 18 16 14 15 14 12 12 10 6 5 5 4 3 1 1 2 2 0 0 tomcatv doduc fpppp Profile based Profile based li 2 bit 2 bit counter CMSC 411 11a from Patterson espresso gcc Tournament 15 Renaming Registers Infinite v 256 Infinite 128 64 32 none 48 integer 40 Fl Pt Branch Prediction 8K 2 bit Perfect Tournament Branch Predictor Cache Perfect Perfect Memory y Alias Perfect Perfect 64KI 32KD 1 92MB L2 36 MB L3 Perfect CMSC 411 11a from Patterson 16 More Realistic HW g Register g Impact p N int N fp p Renaming Limits to ILP HW Model comparison Figure 3 5 SPEC92 New Model Change g 2048 instr window 64 instr issue 8K 2 level Prediction 60 50 Power 5 Instructions 64 Issued per clock Instruction 2048 Window Size Infinite 4 Infinite 200 Renaming Registers 256 Int 256 FP Infinite 48 integer 40 Fl Pt Branch Prediction Cache 8K 2 bit Perfect Tournament Perfect Perfect Memory Alias Perfect v Stack v Inspect v none Perfect 64KI 32KD 1 92MB L2 36 MB L3 Perfect 59 54 49 45 44 40 35 IP PC Model FP 11 45 70 Integer 5 15 30 29 28 20 20 16 15 15 13 11 10 10 10 10 9 5 5 4 15 12 12 12 11 11 6 4 5 5 5 4 7 5 5 0 gcc …
View Full Document