DOC PREVIEW
UMD CMSC 411 - Lecture 13 Instruction Level Parallelism 7

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS252 S05CMSC 411Computer Systems ArchitectureLecture 13Instruction Level Parallelism 7(Limits to ILP & Threading)CMSC 41 1 - 11a (fro m Patterson)2Limits to ILP• Conflicting studies of amount of ILP– Benchmarks » vectorized Fortran FP vs. integer C programs– Hardware sophistication– Compiler sophistication• How much ILP is available using existing mechanisms with increasing HW budgets?• Do we need to invent new HW/SW mechanisms to keep on processor performance curve?– Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints– Intel SSE2: 128 bit, including 2 64-bit FP per clock– Motorola AltiVec: 128 bit ints and FPs– Supersparc Multimedia ops, etc.CMSC 41 1 - 11a (fro m Patterson)3Overcoming Limits• Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies• However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future CMSC 41 1 - 11a (fro m Patterson)4Limits to ILP• Initial HW Model here; MIPS compilers. • Assumptions for ideal/perfect machine to start:1. Register renaming – infinite virtual registers ⇒ all register WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions3. Jump prediction – all jumps perfectly predicted (returns, case statements)2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW• Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CMSC 41 1 - 11a (fro m Patterson)5Model Power 5Instructions Issued per clockInfinite 4Instruction Window SizeInfinite 200Renaming Registers Infinite 48 integer + 40 Fl. Pt.Branch Prediction Perfect 2% to 6% misprediction(Tournament Branch Predictor)Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Alias AnalysisPerfect ??Limits to ILP HW Model comparisonCMSC 41 1 - 11a (fro m Patterson)6Upper Limit to ILP: Ideal MachinePrograms020406080100120140160gcc espresso li fpppp doducd tomcatv54.862.617.975.2118.7150.1Integer: 18 - 60FP: 75 - 150Instructions Per ClockFigure 3.1 – SPEC92CS252 S05CMSC 41 1 - 11a (fro m Patterson)7New Model Model Power 5Instructions Issued per clockInfinite Infinite 4Instruction Window SizeInfinite, 2K, 512, 128, 32Infinite 200Renaming RegistersInfinite Infinite 48 integer + 40 Fl. Pt.Branch PredictionPerfect Perfect 2% to 6% misprediction(Tournament Branch Predictor)Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Alias Perfect Perfect ??Limits to ILP HW Model comparisonCMSC 41 1 - 11a (fro m Patterson)8556318751191503641156159601015124916451013113515348 8914914020406080100120140160gcc espresso li fpppp doduc tomcatvInstructions Per ClockInfinite 2048 512 128 32More Realistic HW: Window ImpactChange from Infinite window 2048, 512, 128, 32FP: 9 - 150Integer: 8 - 63IPCFigure 3.2 – SPEC92CMSC 41 1 - 11a (fro m Patterson)9New Model Model Power 5Instructions Issued per clock64 Infinite 4Instruction Window Size2048 Infinite 200Renaming RegistersInfinite Infinite 48 integer + 40 Fl. Pt.Branch PredictionPerfect vs. 8K Tournament vs. 512 2-bit vs. profile vs. nonePerfect 2% to 6% misprediction(Tournament Branch Predictor)Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Alias Perfect Perfect ??Limits to ILP HW Model comparisonCMSC 41 1 - 11a (fro m Patterson)103541166158609121048156764613456 6745144522229419460102030405060gcc espresso li fpppp doducd tomcatvProgramPerfect Selective predictor Standard 2-bit Static NoneMore Realistic HW: Branch ImpactFigure 3.3 – SPEC92Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycleProfileBHT (512)TournamentPerfectNo predictionFP: 15 - 45Integer: 6 - 12IPCCMSC 41 1 - 11a (fro m Patterson)11Misprediction Rates1%5%14%12%14%12%1%16%18%23%18%30%0%3%2% 2%4%6%0%5%10%15%20%25%30%35%tomcatv doduc fpppp li espresso gccMisprediction RateProfile-based 2-bit counter TournamentFigure 3.4 – SPEC92CMSC 41 1 - 11a (fro m Patterson)12New Model Model Power 5Instructions Issued per clock64 Infinite 4Instruction Window Size2048 Infinite 200Renaming RegistersInfinite v. 256, 128, 64, 32, noneInfinite 48 integer + 40 Fl. Pt.Branch Prediction8K 2-bit Perfect Tournament Branch PredictorCache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Alias Perfect Perfect PerfectLimits to ILP HW Model comparisonCS252 S05CMSC 41 1 - 11a (fro m Patterson)1311151229541015124916101312351544910112011285565 574454555945010203040506070gcc espresso li fpppp doducd tomcatvProgramInfinite 256 128 64 32 NoneMore Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5 – SPEC92Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction64 None256Infinite 32128Integer: 5 - 15FP: 11 - 45IPCCMSC 41 1 - 11a (fro m Patterson)14New Model Model Power 5Instructions Issued per clock64 Infinite 4Instruction Window Size2048 Infinite 200Renaming Registers256 Int + 256 FP Infinite 48 integer + 40 Fl. Pt.Branch Prediction8K 2-bit Perfect TournamentCache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Alias Perfect v. Stack v. Inspect v. nonePerfect PerfectLimits to ILP HW Model comparisonCMSC 41 1 - 11a (fro m Patterson)15Program05101520253035404550gcc espresso li fpppp doducd tomcatv101512491645779491645446535334445Perfect Global/stack Perfect Inspection NoneMore Realistic HW: Memory Address Alias ImpactFigure 3.6 – SPEC92Change 2048 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registersNoneGlobal/Stack perf;heap conflictsPerfect Inspec.Assem.FP: 4 - 45(Fortran,no heap)Integer: 4 - 9IPCCMSC 41 1 - 11a (fro m Patterson)16How to Exceed ILP Limits of this study?• These are not laws of physics; just practical limits for today, and perhaps overcome via research• Compiler and ISA advances could change results• WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage– Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stackCMSC 41 1 - 11a (fro m Patterson)17HW v. SW to increase ILP• Memory disambiguation: HW best• Speculation: – HW best when dynamic branch prediction better than compile time prediction– Exceptions easier for HW– HW


View Full Document

UMD CMSC 411 - Lecture 13 Instruction Level Parallelism 7

Documents in this Course
Load more
Download Lecture 13 Instruction Level Parallelism 7
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 13 Instruction Level Parallelism 7 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 13 Instruction Level Parallelism 7 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?