Simultaneous Multithreading Pratyusa Manadhata pratyus cs Vyas Sekar vyass cs Carnegie Mellon 15740 Fall 03 1 References Susan Eggers Joel Emer Henry Levy Jack Lo Rebecca Stamm and Dean Tullsen Simultaneous Multithreading A Platform for Next generation Processors in IEEE Micro September October 1997 pages 12 18 Jack Lo Susan Eggers Joel Emer Henry Levy Rebecca Stamm and Dean Tullsen Converting Thread Level Parallelism Into Instruction Level Parallelism via Simultaneous Multithreading in ACM Transactions on Computer Systems August 1997 pages 322 354 Dean Tullsen Susan Eggers Joel Emer Henry Levy Jack Lo and Rebecca Stamm Exploiting Choice Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor in Proceedings of the 23rd Annual International Symposium on Computer Architecture May 1996 pages 191202 Carnegie Mellon 15740 Fall 03 2 Motivation For significant performance improvement improving memory subsystem or increasing system integration not sufficient So increase parallelism in all its available form Instruction Level Parallelism ILP Thread Level Parallelism TLP Carnegie Mellon 15740 Fall 03 3 Architectural Alternatives Superscalar Multithreaded Super scalar MultiProcessors Neither superscalar or SMP can capture ILP TLP in its entirety Incapable of adapting to dynamic levels of ILP and TLP Carnegie Mellon 15740 Fall 03 4 Simultaneous Multithreading TLP from either multithreaded parallel programs or from multiprogramming workload ILP from each thread Characteristics of SMT processors from superscalar issue multiple instructions per cycle from multithreaded h w state for multiple threads Carnegie Mellon 15740 Fall 03 5 Multithreaded SMT Ti me Pr ocessor cycl e Superscalar Unut i l i zed Thr ead 1 Thr ead 2 Thr ead 3 Thr ead 4 Thr ead 5 Issue slots Carnegie Mellon 15740 Fall 03 6 Comparison Superscalar Multithreaded looks at multiple instructions from same process both horizontal and vertical waste minimizes vertical waste tolerate long latency operations SMT Selects instructions from any ready thread Carnegie Mellon 15740 Fall 03 7 SMT Model Minimal extension of superscalar processor Changes in IF stage and register files only No static partitioning of resources Most of the hardware is still available to a single thread Carnegie Mellon 15740 Fall 03 8 SMT Model Per thread State for hardware context PC registers Instruction retirement trapping subroutine return Per thread id in BTB and TLB I cache port Large register file No of physical registers 8 32 registers for renaming Longer access time Carnegie Mellon 15740 Fall 03 9 Pipeline superscalar SMT Carnegie Mellon 15740 Fall 03 10 Fetch Mechanism 2 8 scheme Select 2 threads not incurring I cache miss read 8 instructions from each thread Choose as many possible from first thread and rest from the second upto 8 Alternative 1 8 2 4 4 2 Carnegie Mellon 15740 Fall 03 11 I Count Which thread to fetch from threads that have least number of instructions in the decode rename and queue pipeline stages even distribution prevents starvation Carnegie Mellon 15740 Fall 03 12 Results Observations Superscalars approximately give an IPC of about 1 2 SMT significantly higher than the values reported for superscalar Longer latency for a single thread Why not a significant performance effect Carnegie Mellon 15740 Fall 03 13 Results Observations SMT absorbs additional conflicts greater ability to hide latency by using multiple issues from multiple threads SMP MP2 and MP4 hindered by static resource partitioning SMT dynamically partitions resources among threads Carnegie Mellon 15740 Fall 03 14 Results Observations Multithreading can increase cache misses conflicts More memory requirement More stress on branch prediction h w Impact on program performance is not significant SMT h w compiler opts can hide latency Carnegie Mellon 15740 Fall 03 15 Future Directions Each processor in an SMP can use SMT Next generation architectures SMP on chip instead of wider superscalars Is the performance gain adequate with the additional resource cost Processor Cycle Design Time Cost vs Performance Writing optimizing Compilers to take advantage of SMT OS support for thread scheduling thread priority etc Carnegie Mellon 15740 Fall 03 16 Q A Carnegie Mellon 15740 Fall 03 17 Thank You Carnegie Mellon 15740 Fall 03 18
View Full Document
Unlocking...