DOC PREVIEW
Berkeley COMPSCI 152 - Lecture Notes

This preview shows page 1-2-3-4-26-27-28-54-55-56-57 out of 57 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 57 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with ScoreboardsRecall: Achieving Precise ExceptionsRecall: Can we somehow make CPI closer to 1?Recall: FP Loop Showing StallsRecall: Revised FP Loop Minimizing StallsUnroll Loop Four Times (straightforward way)Unrolled Loop That Minimizes StallsGetting CPI < 1: Issuing Multiple Instructions/CycleSlide 9Loop Unrolling in SuperscalarLimits of SuperscalarLoop Unrolling in VLIWSoftware PipeliningSoftware Pipelining ExampleSoftware Pipelining with Loop Unrolling in VLIWCan we use HW to get CPI closer to 1?Problems?The Big Picture: Where are We Now?AdministriviaScoreboard: a bookkeeping techniqueScoreboard Architecture(CDC 6600)Scoreboard ImplicationsFour Stages of Scoreboard ControlSlide 24Three Parts of the ScoreboardScoreboard ExampleDetailed Scoreboard Pipeline ControlScoreboard Example: Cycle 1Scoreboard Example: Cycle 2Scoreboard Example: Cycle 3Scoreboard Example: Cycle 4Scoreboard Example: Cycle 5Scoreboard Example: Cycle 6Scoreboard Example: Cycle 7Scoreboard Example: Cycle 8a (First half of clock cycle)Scoreboard Example: Cycle 8b (Second half of clock cycle)Scoreboard Example: Cycle 9Scoreboard Example: Cycle 10Scoreboard Example: Cycle 11Scoreboard Example: Cycle 12Scoreboard Example: Cycle 13Scoreboard Example: Cycle 14Scoreboard Example: Cycle 15Scoreboard Example: Cycle 16Scoreboard Example: Cycle 17Scoreboard Example: Cycle 18Scoreboard Example: Cycle 19Scoreboard Example: Cycle 20Scoreboard Example: Cycle 21Scoreboard Example: Cycle 22Faster than light computation (skip a couple of cycles)Scoreboard Example: Cycle 61Scoreboard Example: Cycle 62Review: Scoreboard Example: Cycle 62CDC 6600 ScoreboardSummary #1/2: Compiler techniques for parallelismSummary #2/210/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.1CS152Computer Architecture and EngineeringLecture 16Compiler Optimizations (Cont) Dynamic Scheduling with ScoreboardsOctober 26, 2001John Kubiatowicz (http.cs.berkeley.edu/~kubitron)lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.2Recall: Achieving Precise Exceptions°Use pipeline to sort this out!•Pass exception status along with instruction.•Keep track of PCs for every instruction in pipeline.•Don’t act on exception until it reach WB stage°Handle interrupts through “faulting noop” in IF stage°When instruction reaches end of MEM stage:•Save PC  EPC, Interrupt vector addr  PC•Turn all instructions in earlier stages into noops!Program FlowTimeIFetch Dcd Exec Mem WBIFetch Dcd Exec Mem WBIFetch Dcd Exec Mem WBIFetch Dcd Exec Mem WBData TLBBad InstInst TLB faultOverflow10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.3 Recall: Can we somehow make CPI closer to 1?°Let’s assume full pipelining:•If we have a 4-cycle instruction, then we need 3 instructions between a producing instruction and its use:multf $F0,$F2,$F4delay-1delay-2delay-3addf $F6,$F10,$F0Fetch Decode Ex1 Ex2 Ex3 Ex4 WBmultfdelay1delay2delay3addfEarliest forwarding for 4-cycle instructionsEarliest forwarding for1-cycle instructions10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.4Recall: FP Loop Showing Stalls°9 clocks: CPI = 9/5 = 1.8Rewrite code to minimize stalls?Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.5Recall: Revised FP Loop Minimizing Stalls6 clocks: CPI = 6/5 = 1.2)Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBISwap BNEZ and SD by changing address of SDUnroll loop 4 times code to make faster?10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.61 Loop:LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;drop SUBI & BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;drop SUBI & BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;drop SUBI & BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SD -24(R1),F1613 SUBI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 CPI = 27/15 = 1.8Unroll Loop Four Times (straightforward way)Rewrite loop to minimize stalls?1 cycle stall2 cycles stall10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.7°What assumptions made when moved code?•OK to move store past SUBI even though changes register•OK to move loads before stores: get right data?•When is it safe for compiler to do such changes?1 Loop:LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -2414 clock cycles, or 3.5 per iterationCPI = 14/14 = 1When safe to move instructions?Unrolled Loop That Minimizes Stalls10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.8°Two main variations: Superscalar and VLIW°Superscalar: varying no. instructions/cycle (1 to 6)•Parallelism and dependencies determined/resolved by HW•IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164, HP 7100°Very Long Instruction Words (VLIW): fixed number of instructions (16) parallelism determined by compiler•Pipeline is exposed; compiler must schedule delays to get right result°Explicit Parallel Instruction Computer (EPIC)/ Intel•128 bit packets containing 3 instructions (can execute sequentially)•Can link 128 bit packets together to allow more parallelism•Compiler determines parallelism, HW checks dependencies and fowards/stallsGetting CPI < 1: Issuing Multiple Instructions/Cycle10/26/01 ©UCB Fall 2001CS152 / Kubiatowicz Lec16.9°Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pairType PipeStagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF


View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?