Carnegie Mellon, 15740 Fall 03 1Simultaneous MultithreadingPratyusa Manadhata (pratyus@cs)Vyas Sekar(vyass@cs)Carnegie Mellon, 15740 Fall 03 2References Susan Eggers, Joel Emer, Henry Levy, Jack Lo, Rebecca Stamm, and Dean Tullsen. Simultaneous Multithreading: A Platform for Next-generation Processors,in IEEE Micro, September/October 1997, pages 12-18. Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading,in ACM Transactions on Computer Systems, August 1997, pages 322-354. Dean Tullsen, Susan Eggers, Joel Emer, Henry Levy, Jack Lo, and Rebecca Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, in Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.Carnegie Mellon, 15740 Fall 03 3Motivation For significant performance improvement, improving memory subsystem or increasing system integration not sufficient. So increase parallelism in all its available form Instruction Level Parallelism (ILP) Thread Level Parallelism (TLP)Carnegie Mellon, 15740 Fall 03 4Architectural Alternatives Superscalar Multithreaded Super scalar MultiProcessors Neither superscalar or SMP can capture ILP/TLP in its entirety Incapable of adapting to dynamic levels of ILP, and TLPCarnegie Mellon, 15740 Fall 03 5Simultaneous Multithreading TLP from either multithreaded parallel programs or from multiprogramming workload ILP from each thread Characteristics of SMT processors: from superscalar: issue multiple instructions per cycle from multithreaded: h/w state for multiple threadsCarnegie Mellon, 15740 Fall 03 6 Time ( Pr ocessor cycl e) Unutilized Thread 1 Thread 2 Thread 3 Thread 4 Thread 5SuperscalarMultithreadedSMTIssue slotsCarnegie Mellon, 15740 Fall 03 7Comparison Superscalar: looks at multiple instructions from same process, both horizontal and vertical waste. Multithreaded: minimizes vertical waste: tolerate long latency operations SMT : Selects instructions from any "ready" threadCarnegie Mellon, 15740 Fall 03 8SMT Model Minimal extension of superscalar processor Changes in IF stage and register files only No static partitioning of resources Most of the hardware is still available to a single thread.Carnegie Mellon, 15740 Fall 03 9SMT Model Per thread State for hardware context (PC, registers) Instruction retirement, trapping, subroutine return Per thread id in BTB and TLB I cache port Large register file No of physical registers = 8 * 32 + registers for renaming Longer access timeCarnegie Mellon, 15740 Fall 03 10PipelinesuperscalarSMTCarnegie Mellon, 15740 Fall 03 11Fetch Mechanism (2.8 scheme) Select 2 threads not incurring I cache miss, read 8 instructions from each thread. Choose as many possible from first thread and rest from the second, upto8. Alternative – 1.8, 2.4, 4.2Carnegie Mellon, 15740 Fall 03 12I Count Which thread to fetch from threads that have least number of instructions in the decode, rename and queue pipeline stages. even distribution, prevents starvationCarnegie Mellon, 15740 Fall 03 13Results/Observations Superscalars: approximately give an IPC of about 1-2 SMT: significantly higher than the values reported for superscalar Longer latency for a single thread? Why? not a significant performance effectCarnegie Mellon, 15740 Fall 03 14Results/Observations… SMT absorbs additional conflicts: greater ability to hide latency by using multiple issues from multiple threads. SMP MP2 and MP4 hindered by static resource partitioning SMT dynamically partitions resources among threadsCarnegie Mellon, 15740 Fall 03 15Results/Observations.. Multithreading can increase cache misses/conflicts More memory requirement More stress on branch prediction h/w Impact on program performance is not significant -> SMT + h/w + compiler opts can hide latencyCarnegie Mellon, 15740 Fall 03 16Future Directions Each processor in an SMP can use SMT Next generation architectures: SMP on chip instead of wider superscalars Is the performance gain adequate with the additional resource cost Processor Cycle Design Time: Cost vs Performance Writing optimizing Compilers to take advantage of SMT. OS support for thread scheduling, thread priority etcCarnegie Mellon, 15740 Fall 03 17Q & A?Carnegie Mellon, 15740 Fall 03 18Thank
View Full Document