TAMU CSCE 614 - Lec19-static1 - D1942286

Home> Schools> Texas A&M University> Computer Sci. & Engr. (CSCE) > CSCE 614> Lec19-static1

DOC PREVIEW

TAMU CSCE 614 - Lec19-static1

School name Texas A&M University

Course Csce 614- Computer Architecture

Pages 51

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CPSC 614:Graduate Computer Architecture Intro to Static Pipelining Prof. Lawrence RauchwergerReview: Dynamic ExamplesOverviewStatic Branch PredictionRunning ExampleFP Loop: Where are the Hazards?FP Loop Showing StallsRevised FP Loop Minimizing StallsUnroll Loop Four Times (straightforward way)Unrolled Loop DetailUnrolled Loop That Minimizes StallsCompiler Perspectives on Code MovementWhere are the name dependencies?Slide 14Slide 15Steps Compiler Performed to UnrollAnother possibility: Software PipeliningSoftware Pipelining ExampleWhen Safe to Unroll Loop?Does a loop-carried dependence mean there is no parallelism???Hardware Support for Exposing More Parallelism at Compile-TimeSlide 23Exception Behavior SupportHardware Support for Memory Reference SpeculationWhat if Can Chance Instruction Set?VLIW: Very Large Instruction WordRecall: Unrolled Loop that Minimizes Stalls for ScalarLoop Unrolling in VLIWRecall: Software PipeliningRecall: Software Pipelining ExampleSoftware Pipelining with Loop Unrolling in VLIWTrace SchedulingAdvantages of HW (Tomasulo) vs. SW (VLIW) SpeculationSuperscalar v. VLIWProblems with First Generation VLIWIntel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”IA-64 RegistersSlide 39Slide 405 Types of Execution in BundleItanium™ Processor Silicon (Copyright: Intel at Hotchips ’00)Itanium™ Machine Characteristics (Copyright: Intel at Hotchips ’00)PowerPoint Presentation10 Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00)Itanium processor 10-stage pipelineSlide 47Comments on ItaniumPeformance of IA-64 Itanium?Summary#1: Hardware versus Software Speculation MechanismsSummary#2: Hardware versus Software Speculation Mechanisms cont’dSummary #3: Software SchedulingCPSC 614:Graduate Computer Architecture Intro to Static PipeliningProf. Lawrence RauchwergerBased on lectures by Prof. David A. PattersonUC BerkeleyReview: Dynamic Examples•P6 (Pentium Pro, II, III) successful micro-architecture, even with imitator (AMD Athlon)–Translate most 80x86 instructions to micro-operations»Longer pipeline than RISC instructions–Dynamically execute micro-operations •“Netburst” (Pentium 4, …) success not clear–Much longer pipeline, higher clock rate in same technology as P6–Trace Cache to capture micro-operations, avoid hardware translation•Multithreading to increase performance for servers, parallel programs written to use threads–Extra copies of PCs, Registers per thread; e.g., IBM AS/400•Simultaneous Multithreading (SMT) exploit underutilized Dynamic Execution HW to get higher throughput at low extra cost?Overview•Last 3 lectures: binary compatibility and exploiting ILP in hardware: BTB, ROB, Reservation Stations, ...•How far can you go in compiler? •What if you can also change instruction set architecture?•Will see multi billion dollar gamble by two Bay Area firms for the future of computer architecture: HP and Intel to produce IA-64–7 years in the making?Static Branch Prediction•Simplest: Predict taken–average misprediction rate = untaken branch frequency, which for the SPEC programs is 34%. –Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%)•Predict on the basis of branch direction? –choosing backward-going branches to be taken (loop)–forward-going branches to be not taken (if)–SPEC programs, however, most forward-going branches are taken => predict taken is better•Predict branches on the basis of profile information collected from earlier runs–Misprediction varies from 5% to 22%Running Example•This code, a scalar to a vector:for (i=1000; i>0; i=i–1)x[i] = x[i] + s;•Assume following latency all examplesInstruction Instruction Execution Latency producing result using result in cyclesin cyclesFP ALU op Another FP ALU op 4 3FP ALU op Store double 3 2 Load double FP ALU op 1 1Load double Store double 1 0Integer op Integer op 1 0FP Loop: Where are the Hazards?Loop: L.D F0,0(R1) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D 0(R1),F4 ;store result DSUBUI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Where are the stalls?• First translate into MIPS code: -To simplify, assume 8 is lowest addressFP Loop Showing Stalls• 9 clocks: Rewrite code to minimize stalls?Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1 1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D 0(R1),F4 ;store result 7 DSUBUI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slotRevised FP Loop Minimizing Stalls 6 clocks, but just 3 for execution, 3 for loop overhead; How make faster?Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1 1 Loop: L.D F0,0(R1) 2 stall 3 ADD.D F4,F0,F2 4 DSUBUI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 S.D 8(R1),F4 ;altered when move past DSUBUISwap BNEZ and S.D by changing address of S.DUnroll Loop Four Times (straightforward way) Rewrite loop to minimize stalls?1 Loop:L.D F0,0(R1)2 ADD.D F4,F0,F23 S.D 0(R1),F4 ;drop DSUBUI & BNEZ4 L.D F6,-8(R1)5 ADD.D F8,F6,F26 S.D -8(R1),F8 ;drop DSUBUI & BNEZ7 L.D F10,-16(R1)8 ADD.D F12,F10,F29 S.D -16(R1),F12 ;drop DSUBUI & BNEZ10 L.D F14,-24(R1)11 ADD.D F16,F14,F212 S.D -24(R1),F1613 DSUBUI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 41 cycle stall2 cycles stallUnrolled Loop Detail•Do not usually know upper bound of loop•Suppose it is n, and we would like to unroll the loop to make k copies of the body•Instead of a single unrolled loop, we generate a pair of consecutive loops:–1st executes (n mod k) times and has a body that is the original loop–2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times–For large values of n, most of the execution time will be spent in the unrolled loopUnrolled Loop That Minimizes Stalls•What assumptions made when moved code?–OK to move store past DSUBUI even though changes register–OK to move loads before stores: get right data?–When is it safe for compiler to do such changes?1 Loop:L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

TAMU CSCE 614 - Lec19-static1

Sign up for free to view:

Please select your school