CS 152 Computer Architecture and Engineering Lecture 16 VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http www eecs berkeley edu krste http inst eecs berkeley edu cs152 Last time in Lecture 15 Unified physical register file machines remove data values from ROB All values only read and written during execution Only register tags held in ROB Allocate resources ROB slot destination physical register memory reorder queue location during decode Free resources on commit Speculative store buffer holds store values before commit to allow load store forwarding Can execute later loads past earlier stores when addresses known or predicted no dependence 4 3 2008 CS152 Spring 08 2 Little s Law Parallelism Throughput Latency or N T L Throughput per Cycle One Operation Latency in Cycles 4 3 2008 CS152 Spring 08 3 Example Pipelined ILP Machine Max Throughput Six Instructions per Cycle One Pipeline Stage Latency Two Integer Single Cycle in Cycles Units Latency Two Load Store Units Three Cycle Latency Two Floating Point Units Four Cycle Latency How much instruction level parallelism ILP required to keep machine pipelines busy 4 3 2008 CS152 Spring 08 4 Superscalar Control Logic Scaling Issue Width W Issue Group Previously Issued Instruction s Lifetime L Each issued instruction must check against W L instructions i e growth in hardware W W L For in order machines L is related to pipeline latencies For out of order machines L also includes time spent in instruction buffers instruction window or ROB As W increases larger instruction window is needed to find enough parallelism to keep machine busy greater L Out of order control logic grows faster than W2 W3 4 3 2008 CS152 Spring 08 5 Out of Order Control Complexity MIPS R10000 Contro l Logic SGI MIPS Technologies Inc 1995 4 3 2008 CS152 Spring 08 6 Sequential ISA Bottleneck Sequential source code a foo b Superscalar compiler Sequential machine code for i 0 i Find independent operations Schedule operations Superscalar processor 4 3 2008 Schedule Check execution instruction dependencies CS152 Spring 08 7 VLIW Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units Single Cycle Latency Two Load Store Units Three Cycle LatencyTwo Floating Point Units Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of Parallelism within an instruction no cross operation RAW check No data use before data ready no data interlocks 4 3 2008 CS152 Spring 08 8 VLIW Compiler Responsibilities Schedules to maximize parallel execution Guarantees intra instruction parallelism Schedules to avoid data hazards no interlocks Typically separates operations with explicit NOPs 4 3 2008 CS152 Spring 08 9 Early VLIW Machines FPS AP120B 1976 scientific attached array processor first commercial wide instruction machine hand coded vector math libraries using software pipelining and loop unrolling Multiflow Trace 1987 commercialization of ideas from Fisher s Yale group including trace scheduling available in configurations with 7 14 or 28 operations instruction 28 operations packed into a 1024 bit instruction word Cydrome Cydra 5 1987 7 operations encoded in 256 bit instruction word rotating register file 4 3 2008 CS152 Spring 08 10 Loop Execution for i 0 i N i Int1 B i A i C loop ld Compile f1 0 r1 loop Int 2 add r1 M1 M2 FP FPx ld add r1 8 fadd fadd f2 Schedule f0 f1 sd f2 0 r2 add r2 8 add r2 bne sd bne r1 r3 loop How many FP ops cycle 4 3 2008 CS152 Spring 08 11 Loop Unrolling for i 0 i N i B i A i C Unroll inner loop to for i 0 i N i 4 perform 4 iterations at once B i A i C B i 1 A i 1 C B i 2 A i 2 C B i 3 A i 3 C Need to handle values of N that are not multiples of unrolling factor with final cleanup loop 4 3 2008 CS152 Spring 08 12 Scheduling Loop Unrolled Code Unroll 4 ways loop ld f1 0 r1 ld f2 8 r1 ld f3 16 r1 ld f4 24 r1 add r1 32 fadd f5 f0 f1 fadd f6 f0 f2 fadd f7 f0 f3 fadd f8 f0 f4 sd f5 0 r2 sd f6 8 r2 sd f7 16 r2 sd f8 24 r2 add r2 32 bne r1 r3 loop Int1 Int 2 loop Schedul e add r1 M1 ld f1 ld f2 ld f3 ld f4 M2 FP FPx fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 add r2 bne sd f8 How many FLOPS cycle 4 3 2008 CS152 Spring 08 13 Software Pipelining Int1 Unroll 4 ways first loop ld f1 0 r1 ld f2 8 r1 ld f3 16 r1 ld f4 24 r1 add r1 32 fadd f5 f0 f1 fadd f6 f0 f2 fadd f7 f0 f3 fadd f8 f0 f4 sd f5 0 r2 sd f6 8 r2 sd f7 16 r2 add r2 32 sd f8 8 r2 bne r1 r3 loop M1 ld f1 ld f2 ld f3 add r1 ld f4 prolog ld f1 ld f2 ld f3 add r1 ld f4 loop ld f1 iterate ld f2 add r2 ld f3 add r1 bne ld f4 How many FLOPS cycle 4 3 2008 Int 2 epilog add r2 bne CS152 Spring 08 M2 sd f5 sd f6 sd f7 sd f8 sd f5 sd f6 sd f7 sd f8 sd f5 FP FPx fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 14 Software Pipelining vs Loop Unrolling Loop Unrolled Wind down overhead performance Startup overhead Loop Iteration time Software Pipelined performance Loop Iteration time Software pipelining pays startup winddown costs only once per loop not once per iteration 4 3 2008 CS152 Spring 08 15 What if there are no loops Basic block 4 3 2008 Branches limit basic block size in control flow intensive irregular code Difficult to find ILP in individual basic blocks CS152 Spring 08 16 Trace Scheduling Fisher Ellis Pick string of basic blocks a trace that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Schedule whole trace at once Add fixup code to cope with branches jumping out of trace 4 3 2008 CS152 Spring 08 17 Problems with Classic VLIW Object code compatibility have to recompile all code for every machine even for two machines in same generation Object code size instruction padding wastes instruction memory cache loop unrolling software pipelining replicates code Scheduling variable latency memory operations caches and or memory bank conflicts impose statically unpredictable variability Knowing branch probabilities Profiling requires an significant extra step in build process Scheduling for statically unpredictable branches optimal schedule varies with branch path 4 3 2008 CS152 Spring 08 18 VLIW Instruction Encoding Group 1 Group 2 Group 3 Schemes …
View Full Document
Unlocking...