DOC PREVIEW
TAMU CSCE 614 - Lec19-static1

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CPSC 614:Graduate Computer Architecture Intro to Static Pipelining Prof. Lawrence RauchwergerReview: Dynamic ExamplesOverviewStatic Branch PredictionRunning ExampleFP Loop: Where are the Hazards?FP Loop Showing StallsRevised FP Loop Minimizing StallsUnroll Loop Four Times (straightforward way)Unrolled Loop DetailUnrolled Loop That Minimizes StallsCompiler Perspectives on Code MovementWhere are the name dependencies?Slide 14Slide 15Steps Compiler Performed to UnrollAnother possibility: Software PipeliningSoftware Pipelining ExampleWhen Safe to Unroll Loop?Does a loop-carried dependence mean there is no parallelism???Hardware Support for Exposing More Parallelism at Compile-TimeSlide 23Exception Behavior SupportHardware Support for Memory Reference SpeculationWhat if Can Chance Instruction Set?VLIW: Very Large Instruction WordRecall: Unrolled Loop that Minimizes Stalls for ScalarLoop Unrolling in VLIWRecall: Software PipeliningRecall: Software Pipelining ExampleSoftware Pipelining with Loop Unrolling in VLIWTrace SchedulingAdvantages of HW (Tomasulo) vs. SW (VLIW) SpeculationSuperscalar v. VLIWProblems with First Generation VLIWIntel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”IA-64 RegistersSlide 39Slide 405 Types of Execution in BundleItanium™ Processor Silicon (Copyright: Intel at Hotchips ’00)Itanium™ Machine Characteristics (Copyright: Intel at Hotchips ’00)PowerPoint Presentation10 Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00)Itanium processor 10-stage pipelineSlide 47Comments on ItaniumPeformance of IA-64 Itanium?Summary#1: Hardware versus Software Speculation MechanismsSummary#2: Hardware versus Software Speculation Mechanisms cont’dSummary #3: Software SchedulingCPSC 614:Graduate Computer Architecture Intro to Static PipeliningProf. Lawrence RauchwergerBased on lectures by Prof. David A. PattersonUC BerkeleyReview: Dynamic Examples•P6 (Pentium Pro, II, III) successful micro-architecture, even with imitator (AMD Athlon)–Translate most 80x86 instructions to micro-operations»Longer pipeline than RISC instructions–Dynamically execute micro-operations •“Netburst” (Pentium 4, …) success not clear–Much longer pipeline, higher clock rate in same technology as P6–Trace Cache to capture micro-operations, avoid hardware translation•Multithreading to increase performance for servers, parallel programs written to use threads–Extra copies of PCs, Registers per thread; e.g., IBM AS/400•Simultaneous Multithreading (SMT) exploit underutilized Dynamic Execution HW to get higher throughput at low extra cost?Overview•Last 3 lectures: binary compatibility and exploiting ILP in hardware: BTB, ROB, Reservation Stations, ...•How far can you go in compiler? •What if you can also change instruction set architecture?•Will see multi billion dollar gamble by two Bay Area firms for the future of computer architecture: HP and Intel to produce IA-64–7 years in the making?Static Branch Prediction•Simplest: Predict taken–average misprediction rate = untaken branch frequency, which for the SPEC programs is 34%. –Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%)•Predict on the basis of branch direction? –choosing backward-going branches to be taken (loop)–forward-going branches to be not taken (if)–SPEC programs, however, most forward-going branches are taken => predict taken is better•Predict branches on the basis of profile information collected from earlier runs–Misprediction varies from 5% to 22%Running Example•This code, a scalar to a vector:for (i=1000; i>0; i=i–1)x[i] = x[i] + s;•Assume following latency all examplesInstruction Instruction Execution Latency producing result using result in cyclesin cyclesFP ALU op Another FP ALU op 4 3FP ALU op Store double 3 2 Load double FP ALU op 1 1Load double Store double 1 0Integer op Integer op 1 0FP Loop: Where are the Hazards?Loop: L.D F0,0(R1) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D 0(R1),F4 ;store result DSUBUI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Where are the stalls?• First translate into MIPS code: -To simplify, assume 8 is lowest addressFP Loop Showing Stalls• 9 clocks: Rewrite code to minimize stalls?Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1 1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D 0(R1),F4 ;store result 7 DSUBUI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slotRevised FP Loop Minimizing Stalls 6 clocks, but just 3 for execution, 3 for loop overhead; How make faster?Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1 1 Loop: L.D F0,0(R1) 2 stall 3 ADD.D F4,F0,F2 4 DSUBUI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 S.D 8(R1),F4 ;altered when move past DSUBUISwap BNEZ and S.D by changing address of S.DUnroll Loop Four Times (straightforward way) Rewrite loop to minimize stalls?1 Loop:L.D F0,0(R1)2 ADD.D F4,F0,F23 S.D 0(R1),F4 ;drop DSUBUI & BNEZ4 L.D F6,-8(R1)5 ADD.D F8,F6,F26 S.D -8(R1),F8 ;drop DSUBUI & BNEZ7 L.D F10,-16(R1)8 ADD.D F12,F10,F29 S.D -16(R1),F12 ;drop DSUBUI & BNEZ10 L.D F14,-24(R1)11 ADD.D F16,F14,F212 S.D -24(R1),F1613 DSUBUI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 41 cycle stall2 cycles stallUnrolled Loop Detail•Do not usually know upper bound of loop•Suppose it is n, and we would like to unroll the loop to make k copies of the body•Instead of a single unrolled loop, we generate a pair of consecutive loops:–1st executes (n mod k) times and has a body that is the original loop–2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times–For large values of n, most of the execution time will be spent in the unrolled loopUnrolled Loop That Minimizes Stalls•What assumptions made when moved code?–OK to move store past DSUBUI even though changes register–OK to move loads before stores: get right data?–When is it safe for compiler to do such changes?1 Loop:L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D


View Full Document

TAMU CSCE 614 - Lec19-static1

Documents in this Course
Load more
Download Lec19-static1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lec19-static1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lec19-static1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?