Unformatted text preview:

ILP Ends TLP Begins Today s topics Explore a perfect machine unlimited budget to see where ILP goes answer not far enough Look to TLP multi threading for help everything has it s issues we ll look at some of them Apology a bit more data than usual try not to yawn LOUDLY School of Computing University of Utah 1 CS6810 ILP Limits via an Oracle Suspend reality and think of a perfect machine infinite number of rename registers no Wax hazards for window size of n n2 n comparisons for each register field perfect branch jump prediction unbounded buffer of instructions available for execution perfect address alias analysis independent loads can be moved ahead of stores perfect L1 s hit in 1 cycle as many XU s as will ever be needed no structural stalls Infinite cost unrealistic simulate rather than build allows exploration just how far can we get with ILP on a perfect machine and sequential code School of Computing University of Utah 2 Page 1 CS6810 IBM Power5 Most advanced superscalar processor to date 4 fetch 6 issue 88 integer rename regs 88 float rename regs pipeline has over 200 instructions in flight including 32 loads and 32 stores Not quite the Oracle but on the way consumes a lot of power target is blade server segment School of Computing University of Utah 3 CS6810 ILP w Infinite Window Size Looks great when compared w today s IPC 3 if you can ignore the infinite cost School of Computing University of Utah 4 Page 2 CS6810 Limit Window Size ILP shrinks rapidly 2K 4M 3 5 bit compares 512 785K compares 128 49K compares 32 3K compares Plus compares happen every cycle conclude 32 is doable but watty can improve by maybe 3x at great cost remember this is still a perfect machine just w a limited window School of Computing University of Utah 5 CS6810 Switch to Half Infinity You do the math For the remaining data assume 2K window size 12 M 5 bit compares every clock 10x bigger than anything that s been built 64 issue 10x more than anything real Why choose this given other restrictions it won t be a limit can you say easier to simulate I knew you could School of Computing University of Utah 6 Page 3 CS6810 Look at Semi Real Branch Prediction Tnmt 8K entry predictor Jump predictor 2K entries 48K bits and 3 mispredict rate which is very good just expensive Standard 512 entry 2 bit predictor Conclusion have to predict integer codes are a problem yet highly important in modern data center apps Nobody makes much money on floating point sad reality School of Computing University of Utah 7 CS6810 Limiting Rename Registers Integer codes remain problematic problem with FP remains but ILP looks good if you don t care about conclusion need around 64 renamed registers to make much of a difference School of Computing University of Utah 8 Page 4 CS6810 Alias Analysis Influence GL STK heap ref s conflict but nothing else Inspection what can the compiler do School of Computing University of Utah 9 CS6810 Ambitious but Possible HAL 1 better than IBM in all letters 64 issue no restrictions this one is actually ridiculous it does focus on ILP limits rather than structural stalls 1K entry tournament predictor this has been done perfect disambiguation note close to possible for small window sizes impractical for large windows 64 register rename pool 100 have been done What do we get Al the Harpy says a really good heater School of Computing University of Utah 10 Page 5 CS6810 And VIOLA 3 4x improvement for a machine nobody will buy too hot too costly Interesting study clear conclusion ILP is already past the point of diminishing return Programmer is going to need to help out w exposing parallelism Need a different type of HW support for parallelism School of Computing University of Utah 11 CS6810 Enter TLP Again not a new idea been around for 10 years Tullsen UW 1995 publishes the SMT idea TERA MTA IBM Pulsar show up in late 90 s both MT Thread vs Process confusion process runs in it s own virtual memory space no shared memory lots of OS protection overhead communicate via message like channels e g pipes in Unix threads share memory and therefore synchronization needed both are independent entities with their own sets of registers and process state TLP difference multiple threads can run concurrently or interleaved on the same processor one at a time and context switch for processes School of Computing University of Utah 12 Page 6 CS6810 Multi Threading 2 variants fine grained MT e g TERA round robin walk through threads next cycle next thread TERA 128 threads built in 128 cycle load use delay basic idea was to cover main memory latency and do away w caches great if you put every app into a 128 thread mold it failed and Burton goes to the dark side a k a Microsoft coarse grained MT e g IBM Pulsar sometimes called switch on miss basic idea anytime something bad happens TLB or L2 miss switch to next runable thread some sort of fairness policy is required usually just round robin suffices similar goal hide performance effect of long stalls School of Computing University of Utah CS6810 13 Symmetric Multithreading Idea multiple independent threads increase number of parallel instructions to issue Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Fine Grained Multithreading Simultaneous Multithreading superscalar not enough ILP and idle on cache miss FGMT not enough ILP in any one thread SMT improves since broader set of independent instructions programmer supplied parallelism takes advantage of dynamic issue superscalar tactics School of Computing University of Utah 14 Page 7 CS6810 SMT Resource Perspective Each thread has it s own PC next PC next is needed for exceptions private logical registers and mapping to renamed physical registers ROB if shared a stall in one thread will stall the others Shared branch predictor larger size will be needed main memory ports TLB page table artifact of shared memory more threads does increase memory pressure biggest problem is single ported L1 s School of Computing University of Utah CS6810 15 SMT Pipeline Structure Front End Front End Front End Front End Shared Front end Private Front end I Cache Bpred Rename ROB PC Execution Engine Shared Exec Engine Regs IQ DCache FUs What about RAS LSQ School of Computing University of Utah 16 Page 8 CS6810 SMT Issues Single thread performance goes down competition w other threads for resources resource utilization goes up hence throughput goes up Fetch who which thread has priority unless set by user dynamic critical path can


View Full Document

U of U CS 6810 - ILP Ends TLP Begins

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download ILP Ends TLP Begins
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ILP Ends TLP Begins and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ILP Ends TLP Begins 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?