DOC PREVIEW
U of U CS 6810 - ILP Innovations and SMT

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 Lecture 11: ILP Innovations and SMT • Today: out-of-order example, ILP innovations, SMT (Sections 3.9-3.10 and supplementary notes) • HW4 due on Tuesday2 OOO Example • Assumptions same as HW 4, except there are 36 physical registers and 32 logical registers, and width is 4 • Estimate the issue time, completion time, and commit time for the sample code IQ3 Assumptions • Perfect branch prediction, instruction fetch, caches • ADD  dep has no stall; LD  dep has one stall • An instr is placed in the IQ at the end of its 5th stage, an instr takes 5 more stages after leaving the IQ (ld/st instrs take 6 more stages after leaving the IQ) IQ4 OOO Example Original code Renamed code ADD R1, R2, R3 ADD P33, P2, P3 LD R2, 8(R1) LD P34, 8(P33) ADD R2, R2, 8 ADD P35, P34, 8 ST R1, (R3) ST P33, (P3) SUB R1, R1, R5 SUB P36, P33, P5 LD R1, 8(R2) Must wait ADD R1, R1, R2 IQ5 OOO Example Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6 LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8 ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9 ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9 SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9 LD R1, 8(R2) ADD R1, R1, R2 IQ6 OOO Example Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6 LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8 ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9 ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9 SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9 LD R1, 8(R2) LD P1, 8(P35) i+7 i+8 i+14 i+14 ADD R1, R1, R2 ADD P2, P1, P35 i+9 i+10 i+15 i+15 IQ7 Reducing Stalls in Rename/Regfile • Larger ROB/register file/issue queue • Runahead: while a long instruction waits, let a thread run ahead to prefetch (this thread can deallocate resources more aggressively than a processor supporting precise execution) • Two-level register files: values being kept around in the register file for precise exceptions can be moved to 2nd level8 Stalls in Issue Queue • Two-level issue queues: 2nd level contains instructions that are less likely to be woken up in the near future • Value prediction: tries to circumvent RAW hazards • Memory dependence prediction: allows a load to execute even if there are prior stores with unresolved addresses • Load hit prediction: instructions are scheduled early, assuming that the load will hit in cache9 Functional Units • Clustering: allows quick bypass among a small group of functional units; FUs can also be associated with a subset of the register file and issue queue10 Thread-Level Parallelism • Motivation:  a single thread leaves a processor under-utilized for most of the time  by doubling processor area, single thread performance barely improves • Strategies for thread-level parallelism:  multiple threads share the same large processor  reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT)  each thread executes on its own mini processor  simple design, low interference between threads Chip Multi-Processing (CMP) or multi-core11 How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Cycles • Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss • Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated • Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot Superscalar Fine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle12 What Resources are Shared? • Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) • For correctness, each thread needs its own PC, IFQ, logical regs (and its own mappings from logical to phys regs) • For performance, each thread could have its own ROB/LSQ (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources • Each additional thread costs a PC, IFQ, rename tables, and ROB – cheap!13 Front End Front End Front End Front End Execution Engine Rename ROB I-Cache Bpred Regs IQ FUs DCache Private/ Shared Front-end Private Front-end Shared Exec Engine Pipeline Structure14 Resource Sharing R1  R1 + R2 R3  R1 + R4 R5  R1 + R3 R2  R1 + R2 R5  R1 + R2 R3  R5 + R3 P65 P1 + P2 P66  P65 + P4 P67  P65 + P66 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 P65 P1 + P2 P66  P65 + P4 P67  P65 + P66 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 FU FU FU FU Instr Fetch Instr Fetch Instr Rename Instr Rename Issue Queue Register File Thread-1 Thread-215 Performance Implications of SMT • Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread • While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources • With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-416 Pentium4 Hyper-Threading • Two threads – the Linux operating system operates as if it is executing on a two-processor system • When there is only one available thread, it behaves like a regular single-threaded superscalar processor • Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) • Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin),


View Full Document

U of U CS 6810 - ILP Innovations and SMT

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download ILP Innovations and SMT
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ILP Innovations and SMT and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ILP Innovations and SMT 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?