UMD CMSC 411 - Lecture 11a Instruction Level Parallelism - D2956778

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 411> Lecture 11a Instruction Level Parallelism

DOC PREVIEW

UMD CMSC 411 - Lecture 11a Instruction Level Parallelism

School name University of Maryland, College Park

Course Cmsc 411- Computer Systems Architecture

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CMSC 411CMSC 411Computer Systems ArchitectureLecture 11aInstruction Level Parallelism (cont.)Instruction Level Parallelism (cont.)Alan Sussmanl@ d [email protected]• Homework #3 due Thursday, March 26•Finish Chapter 3 of H&P and start Appendix C•Finish Chapter 3 of H&P, and start Appendix CCMSC 411 - 11a (from Patterson)2Memory reference speculation with hardware supportpp• To move loads across stores, when compiler can’t be sure it is legalUlti l dit ti•Use a speculative loadinstruction– hardware saves address of memory location– if a subsequent store changes that location before the check (to end the speculation), then the speculation failed, otherwise it succeededp), p ,– on failure, need to redo load and re-execute all speculated instructions after the speculative loadCMSC 411 - 11a (from Patterson)3Superscalar execution• Predication helps with scheduling• Example: superscalar that can issue 1 memory reference and 1 ALU op per cycle, or just 1 branch1stinstruction 2ndinstructionLW R1,40(R2)ADD R3,R4,R51stinstruction 2ndinstructionLWC loads if 3rdoperand not 0LW R1,40(R2)ADD R3,R4,R5ADD R6,R3,R7BEQZ R10 LLW R1,40(R2) ADD R3,R4,R5LWC R80(R10) R10ADD R6,R3,R7BEQZ R10,LLW R8,0(R10)LW R9 0(R8)R8,0(R10),R10BEQZ R10,LLW R9,0(R8)CMSC 411 - 11a (from Patterson)4LW R9,0(R8)Limitations to ILPLimitations to ILPLimits to ILP• Conflicting studies of amount–Benchmarks (vectorizedFortran FP vs integer C programs)–Benchmarks (vectorizedFortran FP vs. integer C programs)– Hardware sophistication– Compiler sophistication• How much ILP is available using existing mechanisms with increasing HW budgets?• Do we need to invent new HW/SW mechanisms to keep on processor performance curve?– Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints– Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clockMotorolaAltiVec: 128 bitintsand FPsCMSC 411 - 11a (from Patterson)6–Motorola AltiVec: 128 bit intsand FPs– Supersparc Multimedia ops, etc.Overcoming Limits• Advances in compiler technology + significantly new and different hardware techniques maybe able to qyovercome limitations assumed in studies• However, unlikely such advances when coupled with realistic hardware will overcome these limits in near futurefuture CMSC 411 - 11a (from Patterson)7Limits to ILPInitial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start:f1. Register renaming –infinite virtual registers => all register WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions 3. Jump prediction –all jumps perfectly predicted (returns, case statements)2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available4Memoryaddress alias analysisaddresses known & a load can be4. Memory-address alias analysis –addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAWAlso: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CMSC 411 - 11a (from Patterson)8Limits to ILP HW Model comparisonModel Power 5Instructions Issued Infinite4st uct o s ssuedper clockteInstruction Window SizeInfinite 200Renaming Registers Infinite 48 integer + 40 Fl. Pt.Branch Prediction Perfect 2% to 6% misprediction(Tournament Branch Predictor)Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Alias Perfect ??CMSC 411 - 11a (from Patterson)9AnalysisUpper Limit to ILP: Ideal Machine160150 1kFigure 3.1 – SPEC92120140160118.7150.1Integer: 18 - 60FP: 75 - 150Clock8010075.2s Per 406054.862.6ctions02017.9nstrucCMSC 411 - 11a (from Patterson)10Programsgcc espresso li fpppp doducd tomcatvInLimits to ILP HW Model comparisonNew Model Model Power 5Instructions Issued per clockInfinite Infinite 4Instruction Window SizeInfinite, 2K, 512, 128, 32Infinite 200Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt.Branch PredictionPerfect Perfect 2% to 6% misprediction(Tournament Branch Predictor)Cache Perfect Perfect 64KI, 32KD, 1.92MB CMSC 411 - 11a (from Patterson)11L2, 36 MB L3Memory Alias Perfect Perfect ??More Realistic HW: Window Impact150160Change from Infinite window 2048, 512, 128, 32FP: 9 - 150Figure 3.2 – SPEC92119150120140ock637580100ons Per CloInteger: 8 - 63PC55633641615960494535344060InstructioIP1815101512161013111588914914020gccespressolifppppdoductomcatvCMSC 411 - 11a (from Patterson)12gccespressolifppppdoductomcatvInf inite 2048 512 128 32Limits to ILP HW Model comparisonNew Model Model Power 5Instructions Issued per clock64 Infinite 4Instruction Window Size2048 Infinite 200Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt.Branch PredictionPerfect vs. 8K Tournament vs. 12 2Perfect 2% to 6% misprediction512 2-bit vs. profile vs. none(Tournament Branch Predictor)Cache Perfect Perfect 64KI, 32KD, 1.92MB CMSC 411 - 11a (from Patterson)13L2, 36 MB L3Memory AliasPerfect Perfect ??6160More Realistic HW: Branch ImpactFigure 3.3 – SPEC9261586048465060Change from Infinite window to examine to 2048 and maximum FP: 15 - 453541464545454640issue of 64 instructions per clock cycle29192030Integer: 6 - 12PC16912101567613667144191020IP22240gcc espresso li fpppp doducd tomcatvProgramCMSC 411 - 11a (from Patterson)14Perfect Selective predictor Standard 2-bit Static NoneProfileBHT (512)TournamentPerfectNo predictionMisprediction Rates30%30%35%Figure 3.4 – SPEC9218%23%18%20%25%30%n Rate14%12%14%12%16%18%18%10%15%20%spredictio1%5%1%0%3%2% 2%4%6%0%5%10%Mis0%tomcatv doduc fpppp li espresso gccProfile-based2-bit counterTournamentCMSC 411 - 11a (from Patterson)15Profile based 2bit counter TournamentLimits to ILP HW Model comparisonNew Model Model Power 5Instructions Issued per clock64 Infinite 4Instruction Window Size2048 Infinite 200Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt.Branch Prediction8K 2-bit Perfect Tournament Branch PredictorCache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3Memory Perfect Perfect PerfectCMSC 411 - 11a (from Patterson)16yAliasMore Realistic HW: Renaming Register Impact (N int + N fp) 5970gg p(p)Figure 3.5 – SPEC92Change 2048 instr FP: 11 - 45544944455060gwindow, 64 instr issue, 8K 2 level Prediction2935283040Integer: 5 - 15PC11151210151216101312151011201120IP10910556557445455010gcc espresso li fpppp doducd tomcatvCMSC 411 - 11a (from Patterson)17ProgramInfinite 256 128 64 32 None64 None256Infinite 32128Limits to ILP HW Model comparisonNew

View Full Document