Berkeley COMPSCI 152 - Lecture 16 - VLIW Machines and Statically Scheduled ILP - D1743719

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture 16 - VLIW Machines and Statically Scheduled ILP

Berkeley COMPSCI 152 - Lecture 16 - VLIW Machines and Statically Scheduled ILP

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 18

Download Save

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 16 VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http www eecs berkeley edu krste http inst eecs berkeley edu cs152 Last time in Lecture 15 Unified physical register file machines remove data values from ROB All values only read and written during execution Only register tags held in ROB Allocate resources ROB slot destination physical register memory reorder queue location during decode Free resources on commit Speculative store buffer holds store values before commit to allow load store forwarding Can execute later loads past earlier stores when addresses known or predicted no dependence 4 3 2008 CS152 Spring 08 2 Little s Law Parallelism Throughput Latency or N T L Throughput per Cycle One Operation Latency in Cycles 4 3 2008 3 CS152 Spring 08 Example Pipelined ILP Machine Max Throughput Six Instructions per Cycle Latency in Cycles One Pipeline Stage Two Integer Units Single Cycle Latency Two Load Store Units Three Cycle Latency Two Floating Point Units Four Cycle Latency How much instruction level parallelism ILP required to keep machine pipelines busy T 6 4 3 2008 L 2x1 2x3 2x4 2 2 6 3 CS152 Spring 08 N 6 2 2 16 3 4 Superscalar Control Logic Scaling Issue Width W Issue Group Previously Issued Instructions Lifetime L Each issued instruction must check against W L instructions i e growth in hardware W W L For in order machines L is related to pipeline latencies For out of order machines L also includes time spent in instruction buffers instruction window or ROB As W increases larger instruction window is needed to find enough parallelism to keep machine busy greater L Out of order control logic grows faster than W2 W3 4 3 2008 CS152 Spring 08 5 Out of Order Control Complexity MIPS R10000 Control Logic SGI MIPS Technologies Inc 1995 4 3 2008 CS152 Spring 08 6 Sequential ISA Bottleneck Sequential source code Superscalar compiler Sequential machine code a foo b for i 0 i Find independent operations Schedule operations Superscalar processor Check instruction dependencies 4 3 2008 Schedule execution 7 CS152 Spring 08 VLIW Very Long Instruction Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units Single Cycle Latency Two Load Store Units Three Cycle Latency Two Floating Point Units Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of Parallelism within an instruction no cross operation RAW check No data use before data ready no data interlocks 4 3 2008 CS152 Spring 08 8 VLIW Compiler Responsibilities Schedules to maximize parallel execution Guarantees intra instruction parallelism Schedules to avoid data hazards no interlocks Typically separates operations with explicit NOPs 4 3 2008 CS152 Spring 08 9 Early VLIW Machines FPS AP120B 1976 scientific attached array processor first commercial wide instruction machine hand coded vector math libraries using software pipelining and loop unrolling Multiflow Trace 1987 commercialization of ideas from Fisher s Yale group including trace scheduling available in configurations with 7 14 or 28 operations instruction 28 operations packed into a 1024 bit instruction word Cydrome Cydra 5 1987 7 operations encoded in 256 bit instruction word rotating register file 4 3 2008 CS152 Spring 08 10 Loop Execution for i 0 i N i B i A i C Compile Int1 loop Int 2 add r1 M1 M2 FP FPx ld loop ld f1 0 r1 fadd add r1 8 fadd f2 f0 f1 Schedule sd f2 0 r2 add r2 bne add r2 8 sd bne r1 r3 loop How many FP ops cycle 1 fadd 8 cycles 0 125 4 3 2008 CS152 Spring 08 11 Loop Unrolling for i 0 i N i B i A i C Unroll inner loop to perform 4 iterations at once for i 0 i N i 4 B i A i C B i 1 A i 1 C B i 2 A i 2 C B i 3 A i 3 C Need to handle values of N that are not multiples of unrolling factor with final cleanup loop 4 3 2008 CS152 Spring 08 12 Scheduling Loop Unrolled Code Unroll 4 ways Int1 loop ld f1 0 r1 ld f2 8 r1 ld f3 16 r1 ld f4 24 r1 add r1 32 fadd f5 f0 f1 fadd f6 f0 f2 fadd f7 f0 f3 fadd f8 f0 f4 sd f5 0 r2 sd f6 8 r2 sd f7 16 r2 sd f8 24 r2 add r2 32 bne r1 r3 loop Int 2 loop M1 M2 ld f1 ld f2 ld f3 ld f4 add r1 FP FPx fadd f5 fadd f6 fadd f7 fadd f8 Schedule sd f5 sd f6 sd f7 add r2 bne sd f8 How many FLOPS cycle 4 3 2008 4 fadds CS152 Spring 08 11 cycles 0 36 13 Software Pipelining Int1 Unroll 4 ways first loop ld f1 0 r1 ld f2 8 r1 ld f3 16 r1 ld f4 24 r1 add r1 32 fadd f5 f0 f1 fadd f6 f0 f2 fadd f7 f0 f3 fadd f8 f0 f4 sd f5 0 r2 sd f6 8 r2 sd f7 16 r2 add r2 32 sd f8 8 r2 bne r1 r3 loop M1 ld f1 ld f2 ld f3 add r1 ld f4 prolog ld f1 ld f2 ld f3 add r1 ld f4 loop ld f1 iterate ld f2 add r2 ld f3 add r1 bne ld f4 epilog How many FLOPS cycle 4 fadds 4 cycles 1 4 3 2008 Int 2 add r2 bne CS152 Spring 08 M2 sd f5 sd f6 sd f7 sd f8 sd f5 sd f6 sd f7 sd f8 sd f5 FP FPx fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 fadd f5 fadd f6 fadd f7 fadd f8 14 Software Pipelining vs Loop Unrolling Loop Unrolled Wind down overhead performance Startup overhead time Loop Iteration Software Pipelined performance time Loop Iteration Software pipelining pays startup wind down costs only once per loop not once per iteration 4 3 2008 CS152 Spring 08 15 What if there are no loops Basic block 4 3 2008 Branches limit basic block size in control flow intensive irregular code Difficult to find ILP in individual basic blocks CS152 Spring 08 16 Trace Scheduling Fisher Ellis Pick string of basic blocks a trace that represents most frequent branch path Use profiling feedback or compiler heuristics to find common branch paths Schedule whole trace at once Add fixup code to cope with branches jumping out of trace 4 3 2008 CS152 Spring 08 17 Problems with Classic VLIW Object code compatibility have to recompile all code for every machine even for two machines in same generation Object code size instruction padding wastes instruction memory cache loop unrolling software pipelining replicates code Scheduling variable latency memory operations caches and or memory bank conflicts impose statically unpredictable variability Knowing branch probabilities Profiling requires an significant extra …

View Full Document

Berkeley COMPSCI 152 - Lecture 16 - VLIW Machines and Statically Scheduled ILP

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 152 - Lecture 16 - VLIW Machines and Statically Scheduled ILP

Sign up for free to view:

Please select your school