DOC PREVIEW
Berkeley COMPSCI 152 - VLIW Machines and Statically Scheduled ILP

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 16 - VLIW Machines and Statically Scheduled ILPLast time in Lecture 15Little’s LawExample Pipelined ILP MachineSuperscalar Control Logic ScalingOut-of-Order Control Complexity: MIPS R10000Sequential ISA BottleneckVLIW: Very Long Instruction WordVLIW Compiler ResponsibilitiesEarly VLIW MachinesLoop ExecutionLoop UnrollingScheduling Loop Unrolled CodeSoftware PipeliningSoftware Pipelining vs. Loop UnrollingWhat if there are no loops?Trace Scheduling [ Fisher,Ellis]Problems with “Classic” VLIWVLIW Instruction EncodingRotating Register FilesRotating Register FileRotating Register File (Previous Loop Example)Cydra-5: Memory Latency Register (MLR)CS152 AdministriviaIntel EPIC IA-64Quad Core Itanium “Tukwila” [Intel 2008]IA-64 Instruction FormatIA-64 RegistersIA-64 Predicated ExecutionPredicate Software Pipeline StagesFully Bypassed DatapathIA-64 Speculative ExecutionIA-64 Data SpeculationClustered VLIWLimits of Static SchedulingAcknowledgementsCS 152 Computer Architecture and Engineering Lecture 16 - VLIW Machines and Statically Scheduled ILPKrste AsanovicElectrical Engineering and Computer SciencesUniversity of California at Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs1524/3/2008 CS152-Spring’082Last time in Lecture 15•Unified physical register file machines remove data values from ROB–All values only read and written during execution–Only register tags held in ROB–Allocate resources (ROB slot, destination physical register, memory reorder queue location) during decode–Free resources on commit•Speculative store buffer holds store values before commit to allow load-store forwarding•Can execute later loads past earlier stores when addresses known, or predicted no dependence4/3/2008 CS152-Spring’083Little’s LawParallelism = Throughput * LatencyorLatency in CyclesThroughput per CycleOne OperationLTN ×=4/3/2008 CS152-Spring’084Example Pipelined ILP Machine•How much instruction-level parallelism (ILP) required to keep machine pipelines busy?One Pipeline StageTwo Integer Units,Single Cycle LatencyTwo Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,Four Cycle LatencyMax Throughput, Six Instructions per CycleLatency in Cycles4/3/2008 CS152-Spring’085Superscalar Control Logic Scaling•Each issued instruction must check against W*L instructions, i.e., growth in hardware  W*(W*L)•For in-order machines, L is related to pipeline latencies•For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB)•As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L=> Out-of-order control logic grows faster than W2 (~W3)Lifetime LIssue GroupPreviously Issued InstructionsIssue Width W4/3/2008 CS152-Spring’086Out-of-Order Control Complexity:MIPS R10000Control Logic[ SGI/MIPS Technologies Inc., 1995 ]4/3/2008 CS152-Spring’087Check instruction dependenciesSuperscalar processorSequential ISA Bottlenecka = foo(b);for (i=0, i<Sequential source codeSuperscalar compilerFind independent operationsSchedule operationsSequential machine codeSchedule execution4/3/2008 CS152-Spring’088VLIW: Very Long Instruction Word•Multiple operations packed into one instruction•Each operation slot is for a fixed function•Constant operation latencies are specified•Architecture requires guarantee of:–Parallelism within an instruction => no cross-operation RAW check–No data use before data ready => no data interlocksTwo Integer Units,Single Cycle LatencyTwo Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,Four Cycle LatencyInt Op 2Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 14/3/2008 CS152-Spring’089VLIW Compiler Responsibilities•Schedules to maximize parallel execution•Guarantees intra-instruction parallelism•Schedules to avoid data hazards (no interlocks)–Typically separates operations with explicit NOPs4/3/2008 CS152-Spring’0810Early VLIW Machines•FPS AP120B (1976)–scientific attached array processor–first commercial wide instruction machine–hand-coded vector math libraries using software pipelining and loop unrolling•Multiflow Trace (1987)–commercialization of ideas from Fisher’s Yale group including “trace scheduling”–available in configurations with 7, 14, or 28 operations/instruction–28 operations packed into a 1024-bit instruction word•Cydrome Cydra-5 (1987)–7 operations encoded in 256-bit instruction word–rotating register file4/3/2008 CS152-Spring’0811Loop Executionfor (i=0; i<N; i++) B[i] = A[i] + C;Int1Int 2M1 M2 FP+ FPxloop:How many FP ops/cycle?ld add r1fadd sd add r2 bne loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loopCompileSchedule4/3/2008 CS152-Spring’0812Loop Unrollingfor (i=0; i<N; i++) B[i] = A[i] + C;for (i=0; i<N; i+=4){ B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C;}Unroll inner loop to perform 4 iterations at onceNeed to handle values of N that are not multiples of unrolling factor with final cleanup loop4/3/2008 CS152-Spring’0813Scheduling Loop Unrolled Codeloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2)add r2, 32 bne r1, r3, loopScheduleInt1Int 2M1 M2 FP+ FPxloop:Unroll 4 waysld f1ld f2ld f3ld f4add r1 fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r2 bneHow many FLOPS/cycle?4/3/2008 CS152-Spring’0814Software Pipeliningloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loopInt1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r1add r2bneld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5sd f6sd f7sd f8add r1add r2bneld f1ld f2ld f3ld f4fadd f5fadd f6fadd f7fadd f8sd f5add r1loop:iterateprologepilogHow many FLOPS/cycle?4/3/2008 CS152-Spring’0815Software Pipelining vs.


View Full Document

Berkeley COMPSCI 152 - VLIW Machines and Statically Scheduled ILP

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download VLIW Machines and Statically Scheduled ILP
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view VLIW Machines and Statically Scheduled ILP and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view VLIW Machines and Statically Scheduled ILP 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?