Unformatted text preview:

PowerPoint PresentationSlide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 181Lecture 11: ILP Innovations and SMT• Today: out-of-order example, ILP innovations, SMT (Sections 3.9-3.10 and supplementary notes)•HW4 due on Tuesday2OOO Example• Assumptions same as HW 4, except there are 36 physical registers and 32 logical registers, and width is 4• Estimate the issue time, completion time, and commit time for the sample codeIQ3Assumptions• Perfect branch prediction, instruction fetch, caches• ADD  dep has no stall; LD  dep has one stall• An instr is placed in the IQ at the end of its 5th stage, an instr takes 5 more stages after leaving the IQ (ld/st instrs take 6 more stages after leaving the IQ)IQ4OOO Example Original code Renamed codeADD R1, R2, R3 ADD P33, P2, P3LD R2, 8(R1) LD P34, 8(P33)ADD R2, R2, 8 ADD P35, P34, 8ST R1, (R3) ST P33, (P3)SUB R1, R1, R5 SUB P36, P33, P5LD R1, 8(R2) Must waitADD R1, R1, R2IQ5OOO Example Original code Renamed code InQ Iss Comp CommADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9LD R1, 8(R2) ADD R1, R1, R2 IQ6OOO Example Original code Renamed code InQ Iss Comp CommADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9LD R1, 8(R2) LD P1, 8(P35) i+7 i+8 i+14 i+14ADD R1, R1, R2 ADD P2, P1, P35 i+9 i+10 i+15 i+15IQ7Reducing Stalls in Rename/Regfile• Larger ROB/register file/issue queue• Runahead: while a long instruction waits, let a thread run ahead to prefetch (this thread can deallocate resources more aggressively than a processor supporting precise execution)• Two-level register files: values being kept around in the register file for precise exceptions can be moved to 2nd level8Stalls in Issue Queue• Two-level issue queues: 2nd level contains instructions that are less likely to be woken up in the near future• Value prediction: tries to circumvent RAW hazards• Memory dependence prediction: allows a load to execute even if there are prior stores with unresolved addresses• Load hit prediction: instructions are scheduled early, assuming that the load will hit in cache9Functional Units• Clustering: allows quick bypass among a small group of functional units; FUs can also be associated with a subset of the register file and issue queue10Thread-Level Parallelism• Motivation:  a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves• Strategies for thread-level parallelism: multiple threads share the same large processor  reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor  simple design, low interference between threads Chip Multi-Processing (CMP) or multi-core11How are Resources Shared?Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC.Cycles• Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss• Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated• Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slotSuperscalar Fine-GrainedMultithreadingSimultaneousMultithreadingThread 1Thread 2Thread 3Thread 4Idle12What Resources are Shared?• Multiple threads are simultaneously active (in other words, a new thread can start without a context switch)• For correctness, each thread needs its own PC, IFQ, logical regs (and its own mappings from logical to phys regs)• For performance, each thread could have its own ROB/LSQ (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources• Each additional thread costs a PC, IFQ, rename tables, and ROB – cheap!13FrontEndFrontEndFrontEndFrontEndExecution EngineRename ROBI-Cache BpredRegs IQFUsDCachePrivate/SharedFront-endPrivateFront-endSharedExec EnginePipeline Structure14Resource SharingR1  R1 + R2R3  R1 + R4R5  R1 + R3R2  R1 + R2R5  R1 + R2R3  R5 + R3P65 P1 + P2P66  P65 + P4P67  P65 + P66P76  P33 + P34P77  P33 + P76P78  P77 + P35P65 P1 + P2P66  P65 + P4P67  P65 + P66P76  P33 + P34P77  P33 + P76P78  P77 + P35FU FU FU FUInstr FetchInstr FetchInstr RenameInstr RenameIssue QueueRegister FileThread-1Thread-215Performance Implications of SMT• Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread• While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources• With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-416Pentium4 Hyper-Threading• Two threads – the Linux operating system operates as if it is executing on a two-processor system• When there is only one available thread, it behaves like a regular single-threaded superscalar processor• Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale)• Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred17Multi-Programmed


View Full Document

U of U CS 6810 - Lecture Notes

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?