DOC PREVIEW
Berkeley COMPSCI 152 - Lecture 16 - VLIW Machines and Statically Scheduled ILP

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 152 Computer Architecture andEngineering Lecture 16 - VLIW Machines andStatically Scheduled ILPKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California at Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.e du/~cs1524/3/2008 CS152-Spring!082Last time in Lecture 15• Unified physical register file machines remove datavalues from ROB– All values only read and written during execution– Only register tags held in ROB– Allocate resources (ROB slot, destination physical register,memory reorder queue location) during decode– Free resources on commit• Speculative store buffer holds store values beforecommit to allow load-store forwarding• Can execute later loads past earlier stores whenaddresses known, or predicted no dependence4/3/2008 CS152-Spring!083Little’s LawParallelism = Throughput * LatencyorLatency in CyclesThroughput per CycleOne OperationLTN !=4/3/2008 CS152-Spring!084Example Pipelined ILP Machine• How much instruction-level parallelism (ILP) required tokeep machine pipelines busy?One Pipeline StageTwo Integer Units,Single Cycle LatencyTwo Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,Four Cycle LatencyMax Throughput, Six Instructions per CycleLatencyinCycles6 T =322 62x4) 2x3 (2x1 L =++=61322 6 N =!=4/3/2008 CS152-Spring!085Superscalar Control Logic Scaling• Each issued instruction must check against W*L instructions, i.e.,growth in hardware ! W*(W*L)• For in-order machines, L is related to pipeline latencies• For out-of-order machines, L also includes time spent in instructionbuffers (instruction window or ROB)• As W increases, larger instruction window is needed to find enoughparallelism to keep machine busy => greater L=> Out-of-order control logic grows faster than W2 (~W3)Lifetime LIssue GroupPreviouslyIssuedInstructionsIssue Width W4/3/2008 CS152-Spring!086Out-of-Order Control Complexity:MIPS R10000ControlLogic[ SGI/MIPSTechnologiesInc., 1995 ]4/3/2008 CS152-Spring!087Check instructiondependenciesSuperscalar processorSequential ISA Bottlenecka = foo(b);for (i=0, i<Sequentialsource codeSuperscalar compilerFind independentoperationsScheduleoperationsSequentialmachine codeScheduleexecution4/3/2008 CS152-Spring!088VLIW: Very Long Instruction Word• Multiple operations packed into one instruction• Each operation slot is for a fixed function• Constant operation latencies are specified• Architecture requires guarantee of:– Parallelism within an instruction => no cross-operation RAW check– No data use before data ready => no data interlocksTwo Integer Units,Single Cycle LatencyTwo Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,Four Cycle LatencyInt Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 14/3/2008 CS152-Spring!089VLIW Compiler Responsibilities• Schedules to maximize parallel execution• Guarantees intra-instruction parallelism• Schedules to avoid data hazards (no interlocks)– Typically separates operations with explicit NOPs4/3/2008 CS152-Spring!0810Early VLIW Machines• FPS AP120B (1976)– scientific attached array processor– first commercial wide instruction machine– hand-coded vector math libraries using software pipelining and loopunrolling• Multiflow Trace (1987)– commercialization of ideas from Fisher’s Yale group including “tracescheduling”– available in configurations with 7, 14, or 28 operations/instruction– 28 operations packed into a 1024-bit instruction word• Cydrome Cydra-5 (1987)– 7 operations encoded in 256-bit instruction word– rotating register file4/3/2008 CS152-Spring!0811Loop Executionfor (i=0; i<N; i++) B[i] = A[i] + C;Int1 Int 2 M1 M2 FP+ FPxloop:How many FP ops/cycle?ld add r1fadd sd add r2 bne 1 fadd / 8 cycles = 0.125loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loopCompileSchedule4/3/2008 CS152-Spring!0812Loop Unrollingfor (i=0; i<N; i++) B[i] = A[i] + C;for (i=0; i<N; i+=4){ B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C;}Unroll inner loop to perform 4iterations at onceNeed to handle values of N that are not multiplesof unrolling factor with final cleanup loop4/3/2008 CS152-Spring!0813Scheduling Loop Unrolled Codeloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2)add r2, 32 bne r1, r3, loopScheduleInt1 Int 2 M1 M2 FP+ FPxloop:Unroll 4 waysld f1ld f2ld f3ld f4add r1 fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r2 bneHow many FLOPS/cycle?4 fadds / 11 cycles = 0.364/3/2008 CS152-Spring!0814Software Pipeliningloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loopInt1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r1add r2bneld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r1add r2bneld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5add r1loop:iterateprologepilogHow many FLOPS/cycle?4 fadds / 4 cycles = 14/3/2008 CS152-Spring!0815Software Pipelining vs.Loop UnrollingtimeperformancetimeperformanceLoop UnrolledSoftware PipelinedStartup overheadWind-down overheadLoop IterationLoop IterationSoftware pipelining pays startup/wind-downcosts only once per loop, not once per iteration4/3/2008 CS152-Spring!0816What if there are no loops?• Branches limit basic block size incontrol-flow intensive irregularcode• Difficult to find ILP in individualbasic blocksBasic block4/3/2008 CS152-Spring!0817Trace Scheduling [ Fisher,Ellis]• Pick string of basic blocks, a trace, thatrepresents most frequent branch path• Use profiling feedback or compiler heuristicsto find common branch paths• Schedule whole “trace” at once• Add fixup code to cope with branchesjumping out of trace4/3/2008 CS152-Spring!0818Problems with “Classic” VLIW• Object-code compatibility– have to recompile all code for every machine, even for two machines insame generation•


View Full Document

Berkeley COMPSCI 152 - Lecture 16 - VLIW Machines and Statically Scheduled ILP

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture 16 - VLIW Machines and Statically Scheduled ILP
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 16 - VLIW Machines and Statically Scheduled ILP and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 16 - VLIW Machines and Statically Scheduled ILP 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?