Unformatted text preview:

IBM POWER5 CHIP A DUAL CORE MULTITHREADED PROCESSOR FEATURING SINGLE AND MULTITHREADED EXECUTION THE POWER5 PROVIDES HIGHER PERFORMANCE IN THE SINGLE THREADED MODE THAN ITS POWER4 PREDECESSOR AT EQUIVALENT FREQUENCIES ENHANCEMENTS INCLUDE DYNAMIC RESOURCE BALANCING TO EFFICIENTLY ALLOCATE SYSTEM RESOURCES TO EACH THREAD SOFTWARE CONTROLLED THREAD PRIORITIZATION AND DYNAMIC POWER MANAGEMENT TO REDUCE POWER CONSUMPTION WITHOUT AFFECTING PERFORMANCE Ron Kalla Balaram Sinharoy Joel M Tendler IBM 40 IBM introduced Power4 based systems in 2001 1 The Power4 design integrates two processor cores on a single chip a shared second level cache a directory for an off chip third level cache and the necessary circuitry to connect it to other Power4 chips to form a system The dual processor chip provides natural thread level parallelism at the chip level Additionally the Power4 s out of order execution design lets the hardware bypass instructions whose operands are not yet available perhaps because of an earlier cache miss during register loading and execute other instructions whose operands are ready Later when the operands become available the hardware can execute the skipped instruction Coupled with a superscalar design out of order execution results in higher instruction execution parallelism than otherwise possible The Power5 is the next generation chip in this line One of our key goals in designing the Power5 was to maintain both binary and structural compatibility with existing Power4 systems to ensure that binaries continue executing properly and all application optimizations carry forward to newer systems With that base requirement we specified increased performance and other functional enhancements of server virtualization reliability availability and serviceability at both chip and system levels In this article we describe the approach we used to improve chip level performance Published by the IEEE Computer Society 0272 1732 04 20 00 2004 IEEE Multithreading Conventional processors execute instructions from a single instruction stream Despite microarchitectural advances execution unit utilization remains low in today s microprocessors It is not unusual to see average execution unit utilization rates of approximately 25 percent across a broad spectrum of environments To increase execution unit utilization designers use thread level parallelism in which the physical processor core executes instructions from more than one instruction stream To the operating system the physical processor core appears as if it is a symmetric multiprocessor containing two logical processors There are at least three different methods for handling multiple threads In coarse grained multithreading only one thread executes at any instance When a thread encounters a long latency event such as a cache miss the hardware swaps in a second thread to use the machine s resources rather than letting the machine remain idle By allowing other work to use what otherwise would be idle cycles this scheme increases overall system throughput To conserve resources both threads share many system resources such as architectural registers Hence swapping program control from one thread to another requires several cycles IBM implemented coarse grained multithreading in the IBM eServer pSeries Model 680 2 A variant of coarse grained multithreading is fine grained multithreading Machines of this class execute threads in successive cycles in round robin fashion 3 Accommodating this design requires duplicate hardware facilities When a thread encounters a long latency event its cycles remain unused Finally in simultaneous multithreading SMT as in other multithreaded implementations the processor fetches instructions from more than one thread 4 What differentiates this implementation is its ability to schedule instructions for execution from all threads concurrently With SMT the system dynamically adjusts to the environment allowing instructions to execute from each thread if possible and allowing instructions from one thread to utilize all the execution units if the other thread encounters a longlatency event The Power5 design implements two way SMT on each of the chip s two processor cores Although a higher level of multithreading is possible our simulations showed that the added complexity was unjustified As designers add simultaneous threads to a single physical processor the marginal performance benefit decreases In fact additional multithreading might decrease performance because of cache thrashing as data from one thread displaces data needed by another thread Power5 system structure Figure 1 shows the high level structures of Power4 and Power5 based systems The Power4 handles up to a 32 way symmetric multiprocessor Going beyond 32 processors increases interprocessor communication resulting in high traffic on the interconnection Processor a b Processor Processor L2 cache L2 cache Fabric controller Fabric controller L3 cache L3 cache Memory controller Memory controller Memory Memory Processor L3 cache Processor Processor Processor Processor L2 cache L2 cache Fabric controller Fabric controller Memory controller Memory controller Memory Memory L3 cache Figure 1 Power4 a and Power5 b system structures fabric This can cause greater contention and negatively affect system scalability Moving the level three L3 cache from the memory side to the processor side of the fabric lets the Power5 more frequently satisfy level two L2 cache misses with hits in the 36 Mbyte off chip L3 cache avoiding traffic on the interchip fabric References to data not resident in the on chip L2 cache cause the system to check the L3 cache before sending requests onto the interconnection fabric Moving the L3 cache provides significantly more cache on the processor side than previously available thus reducing traffic on the fabric and allowing Power5 based systems to scale to higher levels of symmetric multiprocessing Initial Power5 systems support 64 physical processors The Power4 includes a 1 41 Mbyte on chip L2 cache Power4 chips are similar in design to the Power4 but are fabricated in 130 nm technology rather than the Power4 s 180 nm technology The Power4 includes a 1 5Mbyte on chip L2 cache whereas the Power5 MARCH APRIL 2004 41 HOT CHIPS 15 Figure 2 Power5 chip FXU fixed point execution unit ISU instruction sequencing unit IDU instruction decode unit LSU load store unit IFU instruction fetch unit FPU floating point unit and MC memory controller supports a


View Full Document

Berkeley COMPSCI 152 - IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?