NYU CSCI-GA 2243 - Multiple Issue Processors - D548666

Home> Schools> New York University> Computer Science (CSCI-GA) > CSCI-GA 2243> Multiple Issue Processors

DOC PREVIEW

NYU CSCI-GA 2243 - Multiple Issue Processors

School name New York University

Course Csci-Ga 2243- High Performance Computer Arch

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

110/10/2007 1G22.2243-001High Performance Computer ArchitectureLecture 6Multiple Issue ProcessorsOctober 10, 200710/10/2007 2Outline• Announcements– Assignment 2 due back now– Lab 1 due next week @5pm • Multiple Issue Processors[ Hennessy/Patterson CA:AQA (4th Edition): Chapters 2 and 3]10/10/2007 3Multiple Issue Processors10/10/2007 4Issuing Multiple Instructions Per Cycle•Why?– All of the schemes described so far can at best achieve 1 instruction/cycle– Increasing transistor budgets, parallelism in instruction streams (independent instructions) have pushed for multiple instructions/cycleTwo variations• Superscalar: varying number of instructions/cycle (1 to 8), – static (requires compiler support for most benefit), or dynamic– with or without speculation (implies hardware scheduling)–IBM Power2, Sun UltraSPARC, Pentium III/4, DEC Alpha, HP 8000• (Very) Long Instruction Words (V)LIW: fixed set of instructions (4-16)– scheduled by the compiler: put ops into wide templates– i860, IA64 (with some hardware support)• New metric of performance: Instructions Per Clock cycle (IPC) vs. CPI10/10/2007 5Statically Scheduled Superscalar MIPS ProcessorSuperscalar MIPS: 2 instructions; 1 FP op, 1 other• Instruction issue– Fetch 64-bits/clock cycle• Need to handle cache-line complications– Hardware determines whether 0, 1, or 2 instructions can be issued• Can only issue 2nd instruction if 1st instruction issues• Hazard detection– Likelihood of hazards between two instructions in a packet• Simple solution: treat this as a structural hazard (issue only 1 of them)– 1-instruction load delay can expand to 3-instruction delay in 2-way SS•2ndinstruction in the pair can’t use it without stall, nor 2 instructions in next slot– Branch delay becomes 2 cycles (2ndinst. is branch) or 3 cycles (1stinst. is branch)•Execution– Additional (or pipelined) functional units to derive benefit– Additional port for FP registers to do FP load or FP store and FP op– More bypass paths10/10/2007 62-way Static Superscalar MIPS PipelineWBMEMEXIDIFFPWBMEMEXIDIFIntegerWBMEMEXIDIFFPWBMEMEXIDIFIntegerWBMEMEXIDIFFPWBMEMEXIDIFIntegerWBMEMEXIDIFFPWBMEMEXIDIFIntegerPipe StagesInstruction Type210/10/2007 7Pipeline Scheduling and Loop Unrolling• Rare to find the ideal instruction mix of the previous slide• Modern-day compilers apply several optimizations so as to expose Instruction Level Parallelism (ILP)for (i=1000; i>0; i--)x[i] = x[i] + sL1: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDUI R1, R1, #-8BNE R1, R2, L110stall9BNE R1, R2, L18stall7DADDUI R1, R1, #-86S.D F4, 0(R1)5stall4stall3ADD.D F4, F0, F22stall1L.D F0, 0(R1)L1Issue CycleInstruction3 cycles10/10/2007 8Pipeline Scheduling and Loop Unrolling (cont’d)• Loop unrolling optimization: Replicate loop body multiple times, adjusting the loop termination codeL1: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6, -8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10, -16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14, -24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)DADDUI R1, R1, #-32BNE R1, R2, L114S.D F16, 8(R1)13BNE R1, R2, L112S.D F12, 16(R1)11DADDUI R1, R1, #-3210S.D F4, -8(R1)9S.D F4, 0(R1)8ADD.D F16, F14, F27ADD.D F12, F10, F26ADD.D F8, F6, F25ADD.D F4, F0, F24L.D F14, -24(R1)3L.D F10, -16(R1)2L.D F6, -8(R1)1L.D F0, 0(R1)L1Issue CycleInstruction10/10/2007 9Loop Unrolling for Superscalar Processors• Unroll loop 5 times to avoid extra 1-cycle delays in 2-way SSL1: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6, -8(R1)ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10, -16(R1)ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14, -24(R1)ADD.D F16, F14, F2S.D F16, -24(R1)L.D F18, -32(R1)ADD.D F20, F18, F2S.D F20, -32(R1)DADDUI R1, R1, #-40BNE R1, R2, L1ADD.D F20, F18, F2ADD.D F16, F14, F2ADD.D F12, F10, F2ADD.D F8, F6, F2ADD.D F4, F0, F2FP Instruction12S.D F20, 8(R1)11BNE R1, R2, L110S.D F16, 16(R1)9DADDUI R1, R1, #-408S.D F12, -16(R1)7S.D F4, -8(R1)6S.D F4, 0(R1)5L.D F18, -32(R1)4L.D F14, -24(R1)3L.D F10, -16(R1)2L.D F6, -8(R1)1L.D F0, 0(R1)L1Integer InstructionLoop unrolling: 10 to 3.5 cycles/iterationSS: 3.5 cycles/iteration to 2.4 (1.5 times improvement)10/10/2007 10Dynamic Scheduling in Superscalar Processors• How to extend Tomasulo’s algorithm?• General solution:– Allow issue stage to work faster than rest of architecture• Achieved by both pipelining and widening the issue logic• Instructions issued, reservation stations allocated in order– Rest of the design already supports overlapped execution– Need wider CDB to store multiple results/cycle• Need to allow multiple instruction commits per clock cycle10/10/2007 11Limits of Superscalar Processors• While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:– Exactly 50% FP operations– No hazards• Need more instructions to issue at the same time to get improvedperformance– However, greater difficulty of decode and issue– Even 2-way superscalar => examine 2 opcodes, 6 register specifiers, and decide if 1 or 2 instructions can issue• Issue rates of modern processors vary between 2—8 instructions/cycle• Motivation for VLIW and EPIC processors …10/10/2007 12VLIW/EPIC Architectures• Very Long Instruction Word (VLIW)– processor can initiate multiple operations per cycle– Specified completely by the compiler (unlike superscalar machines)– Hardware is simple: issues the packet given it by the compiler• Explicitly Parallel Instruction Computing (EPIC)– VLIW + new features• predication, rotating registers, speculations, etc. • More later about compiling for VLIW/EPICr1 = L r4 r2 = Add r1,M f1 = Mul f1,f2 r5 = Add r5,43Studies of the Limitations of ILP• Is there that much parallelism in programs?• Start off with a hardware model of an ideal processor1. Register renaming – infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction – perfect; no mispredictions3. Jump prediction – all jumps perfectly predicted 2 and 3 => no control dependencies == machine with perfect speculation == an unbounded buffer of instructions available forexecution4. Memory-address alias analysis –addresses are known and a load can be moved before a store provided addresses not equal1 and 4 => only true data dependencies• 1 cycle latency for all instructions• Perfect caches (loads and stores complete in one cycle)10/10/2007 14ILP Limit for Six SPEC92 Benchmarks• Fair bit of instruction-level

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

NYU CSCI-GA 2243 - Multiple Issue Processors

Sign up for free to view:

Please select your school