Unformatted text preview:

CPE 631 Multithreading Thread Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic milenka ece uah edu http www ece uah edu milenka Outline Trends in microarchitecture Exploiting thread level parallelism Exploiting TLP within a processor Resource sharing Performance implications Design challenges Intel s HT technology AM LaCASA 2 Trends in microarchitecture Higher clock speeds ILP Instruction Level Parallelism AM LaCASA To achieve high clock frequency make pipeline deeper superpipelining Events that disrupt pipeline branch mispredictions cache misses etc become very expensive in terms of lost clock cycles Extract parallelism in a single program Superscalar processors have multiple execution units working in parallel Challenge to find enough instructions that can be executed concurrently Out of order execution instructions are sent to execution units based on instruction dependencies rather than program order 3 Trends in microarchitecture Cache hierarchies Processor memory speed gap Use caches to reduce memory latency Multiple levels of caches smaller and faster closer to the processor core Thread level Parallelism Multiple programs execute concurrently AM LaCASA Web servers have an abundance of software threads Users surfing the web listening to music encoding decoding video streams etc 4 Exploiting thread level parallelism CMP Chip Multiprocessing Time slice multithreading AM LaCASA Multiple processors each with a full set of architectural resources reside on the same die Processors may share an on chip cache or each can have its own cache Examples HP Mako IBM Power4 Challenges Power Die area cost Processor switches between software threads after a predefined time slice Can minimize the effects of long lasting events Still some execution slots are wasted 5 Multithreading Within a Processor Until now we have executed multiple threads of an application on different processors can multiple threads execute concurrently on the same processor Why is this desireable Why does this make sense AM LaCASA inexpensive one CPU no external interconnects no remote or coherence misses more capacity misses most processors can t find enough work peak IPC is 6 average IPC is 1 5 threads can share resources we can increase threads without a corresponding linear increase in area 6 What Resources are Shared AM LaCASA Multiple threads are simultaneously active in other words a new thread can start without a context switch For correctness each thread needs its own PC its own logical regs and its own mapping from logical to phys regs For performance each thread could have its own ROB so that a stall in one thread does not stall commit in other threads I cache branch predictor D cache etc for low interference although note that more sharing better utilization of resources Each additional thread costs a PC rename table and ROB cheap 7 Approaches to Multithreading Within a Processor Fine grained multithreading switches threads on every clock cycle Course grained multithreading switches threads only on costly stalls e g L2 stalls AM LaCASA Pro hide latency of from both short and long stalls Con Slows down execution of the individual threads ready to go Pros no switching each clock cycle no slow down for ready to go threads Con limitations in hiding shorter stalls Simultaneous Multithreading exploits TLP at the same time it exploits ILP 8 How Resources are Shared Each box represents an issue slot for a functional unit Peak thruput is 4 IPC Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Coarse grained Fine Grained Simultaneous Multithreading Multithreading Multithreading AM LaCASA Superscalar processor has high under utilization not enough work every cycle especially when there is a cache miss Fine grained multithreading can only issue instructions from a single thread in a cycle can not find max work every cycle but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle has the highest probability of finding work for every issue slot 9 Resource Sharing Thread 1 R1 R1 R2 R3 R1 R4 R5 R1 R3 P73 P1 P2 P74 P73 P4 P75 P73 P74 Instr Fetch Instr Rename Instr Fetch Instr Rename R2 R1 R2 R5 R1 R2 R3 R5 R3 P76 P33 P34 P77 P33 P76 P78 P77 P35 Issue Queue P73 P1 P2 P74 P73 P4 P75 P73 P74 P76 P33 P34 P77 P33 P76 P78 P77 P35 Thread 2 Register File AM LaCASA FU FU FU FU 10 Performance Implications of SMT AM LaCASA Single thread performance is likely to go down caches branch predictors registers etc are shared this effect can be mitigated by trying to prioritize one thread While fetching instructions thread priority can dramatically influence total throughput a widely accepted heuristic ICOUNT fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources SMT yields throughput improvements of roughly 2 4 Alpha 21464 and Intel Pentium 4 are examples of SMT 11 Design Challenges How many threads Processor front end instruction fetch AM LaCASA Many to find enough parallelism However mixing many threads will compromise execution of individual threads Fetch as far as possible in a single thread to maximize thread performance However this limits the number of instructions available for scheduling from other threads Larger register files multiple contexts Minimize clock cycle time Cache conflicts 12 Pentium 4 Hyperthreading architecture One physical processor appears as multiple logical processors HT implementation on NetBurst microarchitecture has 2 logical processors Architectural Architectural State State AM LaCASA Processor execution resources Architectural state general purpose registers control registers APIC advanced programmable interrupt controller 13 Pentium 4 Hyperthreading architecture Main processor resources are shared Duplicated resources AM LaCASA caches branch predictors execution units buses control logic register alias tables map the architectural registers to physical rename registers next instruction pointer and associated control logic return stack pointer instruction streaming buffer and trace cache fill buffers 14 Pentium 4 Die Size and Complexity AM LaCASA 15 Pentium 4 Resources sharing schemes Partition dedicate equal resources to each logical processors Threshold flexible resource sharing with a limit on maximum resource usage AM LaCASA Good when expect high utilization and somewhat unpredicatable Good


View Full Document

UAH CPE 631 - Thread-Level Parallelism Within a Processor

Loading Unlocking...
Login

Join to view Thread-Level Parallelism Within a Processor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Thread-Level Parallelism Within a Processor and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?