Berkeley COMPSCI 152 - Lecture 21 – Advanced Processors II - D2889782

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture 21 – Advanced Processors II

Berkeley COMPSCI 152 - Lecture 21 – Advanced Processors II

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 27

Download Save

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 21 Advanced Processors II 2005 4 7 John Lazzaro Tha n ks to K rst e As a n ovi c www cs berkeley edu lazzaro TAs Ted Hong and David Marquardt www inst eecs berkeley edu cs152 CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB Last Time Superpipelining Superscalar Address of BNEZ instruction 0b0110 01001000 28 bits 2 bits Branch Target Buffer BTB 28 bit address tag target address 0b0110 0100 PC 4 Loop Taken Address Hit BNEZ R1 Loop Branch History Table BHT Update BHT BTB for next time once true behavior known Taken or Not Taken Must check prediction kill instructions if needed UC Regents Spring 2005 UCB CS 152 L21 Advanced Processors II Today Dynamic Scheduling Overview Goal Enable out of order by breaking pipeline in two fetch and execution Example IBM Power 5 Out of order processing Branch redirects Instruction fetch IF IC BP D0 D1 D2 D3 Xfer GD Group formation and instruction decode MP ISS RF EX MP ISS RF EA MP ISS RF EX MP ISS RF Branch pipeline Load store pipeline DC Fixed point pipeline F6 Interrupts and flushes I fetch and decode like static pipelines Fmt Floatingpoint pipeline WB Xfer WB Xfer WB Xfer WB Xfer CP Today s focus execution unit Figure 3 Power5 instruction pipeline IF instruction fetch IC instruction cache BP branch predict D0 decode stage 0 Xfer transfer GD group dispatch MP mapping ISS instruction issue RF register file read EX execute EA compute address DC data caches F6 six cycle floating point execution pipe Fmt data format WB write back and CP group CS commit 152 L21 Advanced Processors II UC Regents Spring 2005 UCB 90 nm 58 M transistors CS 152 L14 Cache I L1 64K Instruction L1 32K Data 512K L2 PowerPC 970 FX UC Regents Spring 2005 UCB Recall WAR and WAW hazards Write After Read WAR hazards Instruction I2 expects to write over a data value after an earlier instruction I1 reads it But instead I2 writes too early and I1 sees the new value Write After Write WAW hazards Instruction I2 writes over data an earlier instruction I1 also writes But instead I1 writes after I2 and the final data value is incorrect Dynamic scheduling eliminates WAR and WAW hazards making out of order execution tractable CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB Dynamic Scheduling A mix of 3 ideas Imagine an endless supply of registers Top down idea Registers that may be written only once but may be read many times eliminate WAW and WAR hazards Mid level idea An instruction waiting for an operand to execute may trigger on the single write to the associated register eliminates RAW hazards Bottom up idea To support snooping on register writes attach all machine elements to a common bus Robert Tomasulo IBM 1967 FP unit for IBM 360 91 CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB Register Renaming Imagine an endless supply of registers How CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB Consider this simple loop FP Loop Where are the Hazards Loop LD F0 0 R1 array element F0 vector ADDD F4 F0 F2 add scalar from F2 SD store result F4 0 R1 0 R1 F4 SUBI R1 R1 8 decrement pointer 8B DW BNEZ R1 Loop branch R1 zero NOP delayed branch slot I nst r uct ion through I nstthe r uctloop ion introduces Lat ency in Every pass pr oducing r esult using r esult clock cycles the potential for WAW and or WAR hazards FP ALU op Anot her FP ALU op 3 for F0 F4 and R1 FP ALU op Load double Load double CS 152 L21 Advanced Processors II St or e double FP ALU op St or e double 2 1 0 UC Regents Spring 2005 UCB Given an endless supply of registers Rename architected registers Ri Fi to new physical registers PRi PFi on each write FP Loop Where are the Hazards ADDI R1 R0 64 Loop LD F0 0 R1 ADDD F4 F0 F2 SD F4 0 R1 0 R1 F4 ADDI PR01 PR00 64 LD PF00 0 PR01 F0 vector element R1 PR01 add scalar ADDD PF04 PF00 PF02 F0 PF00 from F2 store result BNEZ R1 Loop SD PF04 0 PR01 decrement pointer 8B DW SUBI PR11 PR01 8 branch R1 zero BEQZ PR11 ENDLOOP NOP delayed branch ITER2 slot LD PF10 0 PR11 SUBI R1 R1 8 ADDD PF14 PF10 PF02 Instruction Instruction Latency in SD PF14 producing result using result clock cycles 0 PR11 What was gained FPAn ALUinstruction op Anot her FP ALU op 3 SUBI PR21 PR11 8 FP execute ALU op or e of double 2 BEQZ PR21 ENDLOOP may onceSt all Load double FP ALU op 1 LD PF20 O PR21 ITER3 its source registers Load double St or e double 0 have been written ICSnt152eger op Processors II I nt eger op 0 L21 Advanced UC Regents Spring 2005 UCB Bus Based CPUs CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB A common bus long wires slow Pipelines in theory Long wires are the price we paid to avoid stalls 1600 Conjecture If processor speed is limited by long wires The process was raised from 1 to limit standby power Wires are short lets do a design Circuit design and architectural pipelining ensure low voltage so clock periods performance and functionality To further limit standby current that fully uses in handheld ASSPs a longer poly target takes advantage of the can be short versus dependence and source to body bias is used thein semantics to electrically limit transistor standby mode All core utilize separate source and bulk wiringnMOS byand pMOS transistors of long connections to support this The process includeswires cobalt disilicide gates and diffusions Low source and drain capacitance as abutment well as 3 nm gate oxide thickness high performance and by allow using a bus Fig 1 Pipelines in practice IEEE JOURNAL OF SOLID STATE CIRCUITS VOL 36 NO 11 NOVEMB Process SEM cross section CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB A bus based multi cycle computer From Memory Load Unit If we add too many functional units one bus is too long too slow Solutions more buses faster electrical signalling Register File ALU 1 ALU 2 Common Data Bus data id data value 1 Only one unit writes at a time one source 2 All units may read the written values many destinations if interested in id CS 152 L21 Advanced Processors II Store Unit To Memory UC Regents Spring 2005 UCB Data Driven Execution Associative Control Caveat In comparison to static pipelines there is great diversity in dynamic scheduling implementations Presentation that follows is a composite and does not reflect any specific machine CS 152 L21 Advanced Processors II UC Regents Spring 2005 UCB Recall IBM Power 5 block diagram Queues between instruction fetch and execution Out

View Full Document

Berkeley COMPSCI 152 - Lecture 21 – Advanced Processors II

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 152 - Lecture 21 – Advanced Processors II

Sign up for free to view:

Please select your school