DOC PREVIEW
Berkeley COMPSCI 252 - Instruction Level Parallelism 2: Getting the CPI < 1

This preview shows page 1-2-3-4-5-36-37-38-39-40-73-74-75-76-77 out of 77 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 77 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lecture 7: Instruction Level Parallelism 2: Getting the CPI < 1Review: Hardware unrolling, in-order commit, and explicit register renamingTomasulo With Reorder buffer:Slide 4Memory Disambiguation: Sorting out RAW Hazards in memoryHardware Support for Memory DisambiguationExplicit register renaming: Hardware equivalent of static, single-assignment (SSA) compiler formSlide 8Slide 9Slide 10Slide 11Slide 12Instruction Level ParallelismGetting CPI < 1: Issuing Multiple Instructions/CycleSlide 15Review: Unrolled Loop that Minimizes Stalls for ScalarLoop Unrolling in SuperscalarDynamic Scheduling in SuperscalarMultiple Issue ChallengesLoop Unrolling in VLIWSoftware PipeliningSoftware Pipelining ExampleSoftware Pipelining with Loop Unrolling in VLIWTrace SchedulingAdvantages of HW (Tomasulo) vs. SW (VLIW) SpeculationSuperscalar v. VLIWIntel/HP “Explicitly Parallel Instruction Computer (EPIC)”Limits to Multi-Issue MachinesSlide 29Limits to ILPSlide 31Upper Limit to ILP: Ideal Machine (Figure 4.38, page 319)More Realistic HW: Branch Impact Figure 4.40, Page 323More Realistic HW: Register Impact Figure 4.44, Page 328More Realistic HW: Alias Impact Figure 4.46, Page 330Realistic HW for ‘9X: Window Impact (Figure 4.48, Page 332)Braniac vs. Speed Demon(1993)CS 252 AdministriviaProblems with scalar approach to ILP extractionAlternative Model: Vector ProcessingProperties of Vector ProcessorsOperation & Instruction Count: RISC v. Vector Processor (from F. Quintana, U. Barcelona.)Styles of Vector ArchitecturesComponents of Vector Processor“DLXV” Vector InstructionsMemory operationsDAXPY (Y = a * X + Y)Example Vector MachinesVector Linpack Performance (MFLOPS)Vector SurpriseVirtial Processor Vector ModelVector Architectural StateVector ImplementationVector Terminology: 4 lanes, 2 vector functional unitsVector Execution TimeDLXV Start-up TimeWhy startup time for each vector instruction?Vector Load/Store Units & MemoriesVector LengthStrip MiningCommon Vector MetricsVector StrideCompiler Vectorization on Cray XMPVector Opt #1: ChainingExample Execution of Vector CodeVector Opt #2: Conditional ExecutionVector Opt #3: Sparse MatricesSparse Matrix ExampleApplicationsVector for Multimedia?MMX InstructionsVectors and Variable Data WidthMediaprocesing: Vectorizable? Vector Lengths?Vector PitfallsVector AdvantagesSummary #1Summary #2JDK.F98 Slide 1Lecture 7: Instruction Level Parallelism 2:Getting the CPI < 1Prof. John KubiatowiczComputer Science 252Fall 1998JDK.F98 Slide 2Review: Hardware unrolling, in-order commit, and explicit register renaming•Machines that use hardware techniques with register renaming (such as tomasulo) can unroll loops automatically in hardware•In-Order-Commit is important because:–Allows the generation of precise exceptions–Allows speculation across branches•Use of reorder buffer–Commits user-visible state in instruction order•Explicit register renaming uses a rename table and large bank of physical registersJDK.F98 Slide 33 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1Tomasulo With Reorder buffer:ToMemoryFP addersFP addersFP multipliersFP multipliersReservation StationsFP OpQueueROB7ROB6ROB5ROB5ROB3ROB2ROB1----F0F0<val2><val2><val2><val2>ST 0(R3),F0ST 0(R3),F0ADDD F0,F4,F6ADDD F0,F4,F6YYExExF4F4M[10]M[10]LD F4,0(R3)LD F4,0(R3)YY----BNE F2,<…>BNE F2,<…>NNF2F2F10F10F0F0DIVD F2,F10,F6DIVD F2,F10,F6ADDD F10,F4,F0ADDD F10,F4,F0LD F0,10(R2)LD F0,10(R2)NNNNNNDone?DestDestOldestNewestfrom Memory1 10+R21 10+R2DestReorder BufferRegistersJDK.F98 Slide 4----F0F0<val2><val2><val2><val2>ST 0(R3),F0ST 0(R3),F0ADDD F0,F4,F6ADDD F0,F4,F6YYExExF4F4M[10]M[10]LD F4,0(R3)LD F4,0(R3)YY----BNE F2,<…>BNE F2,<…>NN3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1Tomasulo With Reorder buffer:ToMemoryFP addersFP addersFP multipliersFP multipliersReservation StationsFP OpQueueROB7ROB6ROB5ROB5ROB3ROB2ROB1F2F2F10F10F0F0DIVD F2,F10,F6DIVD F2,F10,F6ADDD F10,F4,F0ADDD F10,F4,F0LD F0,10(R2)LD F0,10(R2)NNNNNNDone?DestDestOldestNewestfrom Memory1 10+R21 10+R2DestReorder BufferRegistersWhat about memoryhazards???JDK.F98 Slide 5Memory Disambiguation:Sorting out RAW Hazards in memory•Question: Given a load that follows a store in program order, are the two related?–(Alternatively: is there a RAW hazard between the store and the load)?Eg: st 0(R2),R5 ld R6,0(R3)•Can we go ahead and start the load early? –Store address could be delayed for a long time by some calculation that leads to R2 (divide?). –We might want to issue/begin execution of both operations in same cycle.–Today: Answer is that we are not allowed to start load until we know that address 0(R2)  0(R3)–Next Week: We might guess at whether or not they are dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong.JDK.F98 Slide 6Hardware Support for Memory Disambiguation•Need buffer to keep track of all outstanding stores to memory, in program order.–Keep track of address (when becomes available) and value (when becomes available)–FIFO ordering: will retire stores from this buffer in program order•When issuing a load, record current head of store queue (know which stores are ahead of you).•When have address for load, check store queue:–If any store prior to load is waiting for its address, stall load.–If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard:»store value available  return value»store value not available  return ROB number of source –Otherwise, send out request to memory•Actual stores commit in order, so no worry about WAR/WAW hazards through memory.JDK.F98 Slide 7Explicit register renaming:Hardware equivalent of static, single-assignment (SSA) compiler formDone?OldestNewestP0P0P2P2P4P4F6F6F8F8P10P10P12P12P14P14P16P16P18P18P20P20P22P22P24P24p26p26P28P28P30P30P32P32P34P34P36P36P38P38P60P60P62P62Current Map TableFreelist•Physical register file larger than ISA register file•On issue, each instruction that modifies a register is allocated new physical register from freelist•Used on: R10000, Alpha 21264, HP PA8000JDK.F98 Slide 8Explicit register renaming:Hardware equivalent of static, single-assignment (SSA) compiler formF0F0P0P0LD P32,10(R2)LD P32,10(R2)NNDone?OldestNewestP32P32P2P2P4P4F6F6F8F8P10P10P12P12P14P14P16P16P18P18P20P20P22P22P24P24p26p26P28P28P30P30P34P34P36P36P38P38P40P40P60P60P62P62Current Map TableFreelist•Note that


View Full Document

Berkeley COMPSCI 252 - Instruction Level Parallelism 2: Getting the CPI < 1

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Instruction Level Parallelism 2: Getting the CPI < 1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Instruction Level Parallelism 2: Getting the CPI < 1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Instruction Level Parallelism 2: Getting the CPI < 1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?