Unformatted text preview:

Last Time Superpipelining Superscalar CS 152 Computer Architecture and Engineering 1600 IEEE JOURNAL OF SOLID STATE CIRCUITS VOL 36 NO 11 NOVEMBER 2001 Seconds Program Lecture 21 Advanced Processors II 2004 11 16 Dave Patterson www cs berkeley edu patterson Fig 1 John Lazzaro Seconds Cycle CPI Problem Possible Solution Extra branch delays Branch prediction Extra load delays Optimize code Structural hazards Optimize code add hardware The process was raised from 1 to limit standby power Circuit design and architectural pipelining ensure low voltage performance and functionality To further limit standby current www inst eecs berkeley edu cs152 III ARCHITECTURE UC Regents Fall 2004 UCB The microprocessor contains 32 kB instruction and data caches as well as an eight entry coalescing 1writeback buffer The instruction and data cache fill buffers have two and four entries respectively The data cache supports hit under miss operation and lines may be locked to allow SRAM like operation Thirty two entry fully associative translation lookaside buffers TLBs that support multiple page sizes are provided for both caches TLB entries may also be locked A 128 entry branch target buffer improves branch performance a pipeline deeper than earlier high performance ARM designs 2 3 A Pipeline Organization To obtain high performance the microprocessor core utilizes a simple scalar pipeline and a high frequency clock In addition to avoiding the potential power waste of a superscalar approach functional design and validation complexity is decreased at the expense of circuit design effort To avoid circuit design issues the pipeline partitioning balances the workload and ensures that no one pipeline stage is tight The main integer pipeline is seven stages memory operations follow an eight stage pipeline and when operating in thumb mode an extra pipe stage is inserted after the last fetch stage to convert thumb instructions into ARM instructions Since thumb mode instructions 11 are 16 b two instructions are fetched in parallel while executing thumb instructions A simplified diagram of the processor pipeline is Today Dynamic Scheduling Overview Goal Enable out of order by breaking pipeline in two fetch and execution Example IBM Power 5 ARM XScale 8 stages shown in Fig 2 where the state boundaries are indicated by Fig 2 Microprocessor pipeline organization CS 152 L21 Advanced Processors II UC Regents Fall 2004 UCB gray Features that allow the microarchitecture to achieve high speed are as follows The shifter and ALU reside in separate stages The ARM instruction set allows a shift followed by an ALU operation in a single instruction Previous implementations limited frequency by having the shift and ALU in a single stage Splitting this operation reduces the critical ALU bypass path by approximately 1 3 The extra pipeline hazard introduced when an instruction is immediately followed by one requiring that the result be shifted is infrequent Decoupled Instruction Fetch A two instruction deep queue is implemented between the second fetch and instruction decode pipe stages This allows stalls generated later in the pipe to be deferred by one or more cycles in the earlier pipe stages thereby allowing instruction fetches to proceed when the pipe is stalled and also relieves stall speed paths in the instruction fetch and branch prediction units Deferred register dependency stalls While register dependencies are checked in the RF stage stalls due to these hazards are deferred until the X1 stage All the necessary operands are then captured from result forwarding busses as the results are returned to the register file One of the major goals of the design was to minimize the energy consumed to complete a given task Conventional wisdom has been that shorter pipelines are more efficient due to re 2 Dynamic Scheduling A mix of 3 ideas Top down idea Registers that may be written only once but may be read many times eliminate WAW and WAR hazards Out of order processing Branch redirects Instruction fetch IC Cycles Instruction Process SEM cross section versus dependence and source to body bias is used to electrically limit transistor in standby mode All core nMOS and pMOS transistors utilize separate source and bulk connections to support this The process includes cobalt disilicide gates and diffusions Low source and drain capacitance as well as 3 nm gate oxide thickness allow high performance and low voltage operation IF Instructions Program Q Could adding pipeline stages reduce CPI for an application A Yes due to these problems Tha n K rst ks to e As a n ovi c in handheld ASSPs a longer poly target takes advantage of the www cs berkeley edu lazzaro CS 152 L21 Advanced Processors II BP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX MP ISS RF EA MP ISS RF EX MP ISS RF Branch pipeline Load store pipeline DC Fixed point pipeline F6 Interrupts and flushes Fmt Floatingpoint pipeline WB Xfer WB Xfer WB Xfer WB Xfer Mid level idea An instruction waiting for an operand to execute may trigger on the single write to the associated register CP Today s focus execution unit I fetch and decode like static pipelines Figure 3 Power5 instruction pipeline IF instruction fetch IC instruction cache BP branch predict D0 decode stage 0 Xfer transfer GD group dispatch MP mapping ISS instruction issue RF register file read EX execute EA compute address DC data caches F6 six cycle floating point execution pipe Fmt data format WB write back and CP group CS commit 152 L21 Advanced Processors II UC Regents Fall 2004 UCB Bottom up idea To support snooping on register writes attach all machine elements to a common bus Robert Tomasulo IBM 1967 FP unit for IBM 360 91 CS 152 L21 Advanced Processors II UC Regents Fall 2004 UCB 3 Dynamic instruction selection Branch prediction Branch history tables Program counter Return stack Shared execution units Shared issue queues Target cache LSU0 Alternate Instruction cache Instruction translation Data Translation FXU0 Instruction buffer 0 Data Cache LSU1 Group formation Instruction decode Dispatch Instruction buffer 1 4 FXU1 Group completion FPU0 Store queue FPU1 BXU Thread priority Sharedregister mappers Shared by two threads Read sharedregister files Thread 0 resources CRL Write sharedregister files Data translation A common bus long wires slow Data cache L2 cache Pipelines in theory Long wires are the price we paid to avoid stalls 1600 Thread 1 resources Figure 4 Power5 instruction data flow


View Full Document

Berkeley COMPSCI 152 - Lecture 21 – Advanced Processors II

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 21 – Advanced Processors II and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 21 – Advanced Processors II and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?