Limits to ILP Conflicting studies of amount of ILP Benchmarks CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 7 Limits to ILP Threading vectorized Fortran FP vs integer C programs Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets Do we need to invent new HW SW mechanisms to keep on processor performance curve Intel MMX SSE Streaming SIMD Extensions 64 bit ints Intel SSE2 128 bit including 2 64 bit FP per clock Motorola AltiVec 128 bit ints and FPs Supersparc Multimedia ops etc 2 CMSC 411 11a from Patterson Overcoming Limits Limits to ILP Initial HW Model here MIPS compilers Assumptions for ideal perfect machine to start 1 Register renaming infinite virtual registers all register WAW WAR hazards are avoided 2 Branch prediction perfect no mispredictions 3 Jump prediction all jumps perfectly predicted returns case statements 2 3 no control dependencies perfect speculation an unbounded buffer of instructions available 4 Memory address alias analysis addresses known a load can be moved before a store provided addresses not equal 1 4 eliminates all but RAW Also perfect caches 1 cycle latency for all instructions FP unlimited instructions issued clock cycle Advances in compiler technology significantly new and different hardware techniques may be able to overcome limitations assumed in studies However unlikely such advances when coupled with realistic hardware will overcome these limits in near future 3 Limits to ILP HW Model comparison Upper Limit to ILP Ideal Machine Model Power 5 Infinite 4 Infinite 200 Infinite 48 integer 40 Fl Pt 2 to 6 misprediction Tournament Branch Predictor 64KI 32KD 1 92MB L2 36 MB L3 Instructions Issued per clock Instruction Window Size Renaming Registers Branch Prediction Perfect Cache Perfect Memory Alias Analysis Perfect 4 CMSC 411 11a from Patterson Figure 3 1 SPEC92 Instructions Per Clock CMSC 411 11a from Patterson 160 FP 75 150 150 1 140 120 Integer 18 60 118 7 100 75 2 80 62 6 60 54 8 40 17 9 20 0 gcc espresso li fpppp doducd tomcatv Programs CMSC 411 11a from Patterson CS252 S05 5 CMSC 411 11a from Patterson 6 Limits to ILP HW Model comparison New Model Model Power 5 Instructions Issued per clock Instruction Window Size Infinite Infinite 4 Infinite 2K 512 128 32 Infinite 200 Renaming Registers Infinite Infinite 48 integer 40 Fl Pt Branch Prediction Perfect More Realistic HW Window Impact Figure 3 2 SPEC92 Change from Infinite window 2048 512 128 32 FP 9 150 160 Cache 150 Perfect Perfect Perfect Memory Alias Perfect Perfect 119 120 2 to 6 misprediction Tournament Branch Predictor 64KI 32KD 1 92MB L2 36 MB L3 Integer 8 63 100 IPC Instructions Per Clock 140 75 80 63 60 40 20 gcc 1815 1211 9 8 espresso Model Power 5 64 Infinite 4 1615 14 li Infinite 14 9 f pppp 2048 512 doduc 128 tomcatv 32 8 CMSC 411 11a from Patterson More Realistic HW Branch Impact Figure 3 3 SPEC92 61 Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle 60 50 Cache 1513 1010 8 45 34 35 0 Limits to ILP HW Model comparison New Model 60 59 49 41 36 7 CMSC 411 11a from Patterson 60 58 FP 15 45 48 46 46 45 45 45 41 40 35 2048 Infinite 200 30 Infinite Infinite Perfect vs 8K Tournament vs 512 2 bit vs profile vs none Perfect Perfect Perfect Memory Alias Perfect Perfect 48 integer 40 Fl Pt 2 to 6 misprediction Tournament Branch Predictor 64KI 32KD 1 92MB L2 36 MB L3 IPC Instructions Issued per clock Instruction Window Size Renaming Registers Branch Prediction 61 55 19 20 16 15 12 13 14 10 10 9 6 7 6 6 6 7 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Perfect Selective predictor Tournament 9 CMSC 411 11a from Patterson 29 Integer 6 12 Standard 2 bit BHT 512 Static Profile None No prediction 10 CMSC 411 11a from Patterson Limits to ILP HW Model comparison Misprediction Rates Figure 3 4 SPEC92 New Model Model Power 5 Instructions Issued per clock Instruction Window Size 64 Infinite 4 2048 Infinite 200 Renaming Registers Infinite v 256 Infinite 128 64 32 none 48 integer 40 Fl Pt Branch Prediction 8K 2 bit Perfect Tournament Branch Predictor Cache Perfect Perfect Memory Alias Perfect Perfect 64KI 32KD 1 92MB L2 36 MB L3 Perfect 35 30 Misprediction Rate 30 23 25 18 20 18 16 14 15 14 12 12 10 6 5 5 4 3 1 1 2 2 0 0 tomcatv doduc fpppp Profile based 2 bit counter CMSC 411 11a from Patterson CS252 S05 li espresso gcc Tournament 11 CMSC 411 11a from Patterson 12 More Realistic HW Renaming Register Impact N int N fp Limits to ILP HW Model comparison Figure 3 5 SPEC92 New Model Model Power 5 Instructions Issued per clock Instruction Window Size 64 Infinite 4 2048 Infinite 200 Renaming Registers 256 Int 256 FP Infinite 48 integer 40 Fl Pt Branch Prediction Cache 8K 2 bit Perfect Tournament Perfect Perfect 64KI 32KD 1 92MB L2 36 MB L3 Perfect FP 11 45 70 Change 2048 instr window 64 instr issue 8K 2 level Prediction 59 60 50 54 49 45 44 40 IPC 35 Integer 5 15 30 29 28 20 20 16 15 15 13 11 10 10 10 5 5 4 15 12 12 12 11 10 9 11 6 5 4 5 5 4 7 5 5 0 gcc espresso li fpppp doducd Memory Alias Perfect v Stack Perfect v Inspect v none tomcatv Program Infinite Infinite 256 256 128 128 64 32 None 64 32 CMSC 411 11a from Patterson None 13 CMSC 411 11a from Patterson More Realistic HW Memory Address Alias Impact Figure 3 6 SPEC92 49 50 40 35 30 How to Exceed ILP Limits of this study 49 45 Change 2048 instr window 64 instr issue 8K 2 level Prediction 256 renaming registers 45 14 These are not laws of physics just practical limits for today and perhaps overcome via research 45 FP 4 45 Fortran no heap Compiler and ISA advances could change results IPC 25 Integer 4 9 20 16 16 15 15 12 10 10 5 9 7 7 4 5 5 4 3 4 3 6 5 4 3 0 gcc espresso li fpppp doducd tomcatv 4 WAR and WAW hazards through memory eliminated WAW and WAR hazards through register renaming but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack Program Perfect Perfect Global stack Perfect Inspection Global Stack perf Inspec heap conflicts Assem None None CMSC 411 11a from Patterson CMSC 411 11a from Patterson 16 HW v SW to increase ILP Performance beyond single thread ILP Memory disambiguation HW best Speculation HW best when dynamic branch prediction better than compile time prediction Exceptions easier for HW HW doesn t need bookkeeping code or compensation code Very complicated to get
View Full Document