Five options for multiplier latency Lecture 20 Advanced Processors I 1 cycle option is fully spatial 2004 11 09 Dave Patterson www cs berkeley edu patterson John Lazzaro 35 cycle is mini Lab 2 www cs berkeley edu lazzaro www inst eecs berkeley edu cs152 CS 152 L20 Advanced Processors I 2 4 5 cycles UC Regents Fall 2004 UCB CS 152 L20 Advanced Processors I UC Regents Fall 2004 UCB 1 2 Today Beyond the 5 stage pipeline 5 Stage Pipeline A point of departure Graphically Representing MIPS Pipeline Seconds Instructions Cycles Seconds Program Instruction Program Cycle Taxonomy Introduction to advanced processor techniques Superscalar Issuing several instructions in a single cycle IM ALU Fi l l i n g de l ay a l l b ra n s l ots c h l o ad At best the 5 stage pipeline executes one instruction per with a clock period Can helpclock with answering questions like determined by the slowest stage how many cycles does it take to execute this code t fe c Pe r h i n g cac Superpipelining Increasing the number of pipeline stages Reg DM Reg what is the ALUdoes doingnot during cycle 4 Application need multi cycle instructions multiply divide etc is there a hazard why does it occur and how can it be fixed CS 152 L20 Advanced Processors I UC Regents Fall 2004 UCB CS 152 L20 Advanced Processors I UC Regents Fall 2004 UCB 3 4 CS 152 L10 Pipeline Intro 9 Cycles Seconds Cycle 8 B Ultimate Limiter Programs may be a poor match to issue rules 012 3 4 556 CS 152 L20 Advanced Processors I 5 B Difficulties Load and branch delays affect more instructions 012 3 4 550 UC Regents Fall 2004 UCB Seconds Cycle Example CPU with floating point ALUs issue 1 FP 1 integer instruction per cycle KL shown in Fig 2 whereCSthe are indicated by 152state L20 boundaries Advanced Processors I gray Features that allow the microarchitecture to achieve high speed are as follows The shifter and ALU reside in separate stages The ARM instruction set allows a shift followed by an ALU operation in a single instruction Previous implementations limited frequency by having the shift and ALU in a single stage Splitting this operation reduces the critical ALU bypass path by approximately 1 3 The extra pipeline hazard introduced when an instruction is immediately followed by one requiring that the result be shifted is infrequent Decoupled Instruction Fetch A two instruction deep queue is implemented between the second fetch and instruction decode Cycles Instruction Goal Improve CPI by issuing several instructions per cycle N Ultimate Limiter As logic delay goes to 0 FF clk to Q and setup Instructions Program 8 M4 7 DD Difficulties Added penalties for load delays and branch misses 8A BA 789 9 9 I 8 7N8A 7D Example 8 stage ARM XScale extra IF ID data cache stages IJ OPQR 7PQR Goal Reduce critical path by adding more pipeline stages Fig 2 Microprocessor pipeline organization Seconds Program 789 9 9 Instruction 789 9 89 9 9 A B B A 9 C Instructions Program B 9D A D E 9 9 9 8 9 9 F89 9 89 9B8 F89 9 89 G 1C1 C F A H 9 9C D8 9C A 9C A 9 B 9 I7 Seconds Program Fall 2004 UC Regents Superscalar Multiple issues per cycle To day Superpipelining Add more stages To day IEEE JOURNAL OF SOLID STATE CIRCUITS VOL 36 NO 11 NOVEMBER 2001 standby power ure low voltage standby current advantage of the ody bias is used mode All core source and bulk des cobalt disilin capacitance as performance and uction and data writeback buffer ve two and four s hit under miss RAM like operlation lookaside es are provided ed A 128 entry mance a pipeline signs 2 3 Last Time Multipliers Space vs Time CS 152 Computer Architecture and Engineering UC Regents Fall 2004 UCB 6 Out of Order Going around stalls Nex t day Seconds Program Instructions Program Cycles Seconds Cycle Instruction Tu e s 0 ADDD 1 2 3 45 3 0 123 4 5 664 MULTD waiting F4 to on load Out of order processing Branch redirects Instruction fetch IF IC BP D0 Difficulties Bookkeeping is highly complex 6 78 0 A poor fit for lockstep instruction scheduling 786 A 2 2 2 2 2 2 4 0 0 4of 2 2 2 1 1 Ultimate Limiter The amount instruction 49 AB A 9 C D 9 A9 EB A 9 level parallelism present in an application F G H A9I A CJ BK CS 152 L20 Advanced Processors I Tu e s Goal Enable out of order by breaking pipeline in two fetch and execution Example IBM Power 5 7869 7 5 8 Goal Issue instructions out of program order Example so let ADDD go first Dynamic Scheduling End lockstep Nex t day UC Regents Fall 2004 UCB D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX MP ISS RF EA MP ISS RF EX MP ISS RF Branch pipeline Load store pipeline DC Fixed point pipeline F6 Interrupts and flushes Fmt Floatingpoint pipeline WB Xfer WB Xfer WB Xfer WB Xfer CP Limiters Design complexity instruction level parallelism Figure 3 Power5 instruction pipeline IF instruction fetch IC instruction cache BP branch predict D0 decode stage 0 Xfer transfer GD group dispatch MP mapping ISS instruction issue RF register file read EX execute EA compute address DC data caches F6 six cycle floating point execution pipe Fmt data format WB write back and CS 152 L20 Advanced Processors I UC Regents Fall 2004 UCB CP group commit 7 8 Dynamic instruction selection Branch prediction Branch history tables Program counter 0 123 4 5 660 9 6 B69 C D H5I 7C 7 E Instruction cache FG Thu E J Goal Use multiple CPUs real and virtual to 7 K L BB J J DJ 8 8 M 8K improve 1 throughput of machines that run 2 many programs 2 execution time of multi C 8 N 8 8 L BB B threaded D 8 8 8 8 FHO programs LSU0 Instruction translation FXU0 Instruction buffer 0 Instruction buffer 1 Data Translation Data Cache LSU1 Group formation Instruction decode Dispatch FXU1 FPU0 Group completion Store queue FPU1 BXU Administrivia No class on Thursday Thread priority Sharedregister mappers Read sharedregister files CRL Write sharedregister files Data translation Shared by two threads Thread 0 resources Data cache L2 cache HW 4 due Weds 11 10 5PM 283 Soda Thread 1 resources Figure 4 Power5 instruction data flow BXU branch execution unit and CRL condition register logical execution unit The Power5 scans fetched instructions for branches BP stage and if it finds a branch predicts the branch direction using three branch history tables shared by the two threads Two of the BHTs use bimodal and path correlated branch prediction mechanisms to predict branch directions 6 7 The third BHT predicts which of these prediction mechanisms is more likely to predict the cor rect direction 7 If the fetched instructions con
View Full Document
Unlocking...