March 14, 2011 CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!http://inst.eecs.berkeley.edu/~cs152!March 14, 2011 CS152, Spring 2011 2 Last time in Lecture 12 • Unified physical register file machines remove data values from ROB – All values only read and written during execution – Only register tags held in ROB – Allocate resources (ROB slot, destination physical register, memory reorder queue location) during decode – Issue window can be separated from ROB and made smaller than ROB (allocate in decode, free after instruction completes) – Free resources on commit • Speculative store buffer holds store values before commit to allow load-store forwarding • Can execute later loads past earlier stores when addresses known, or predicted no dependenceMarch 14, 2011 CS152, Spring 2011 3 Fetch Decode & Rename Reorder Buffer PC Branch Prediction Update predictors Commit Datapath: Branch Prediction and Speculative Execution Branch Resolution Branch Unit ALU Reg. File MEM Store Buffer D$ Execute kill kill kill killMarch 14, 2011 CS152, Spring 2011 4 Instruction Flow in Unified Physical Register File Pipeline • Fetch – Get instruction bits from current guess at PC, place in fetch buffer – Update PC using sequential address or branch predictor (BTB) • Decode/Rename – Take instruction from fetch buffer – Allocate resources to execute instruction: » Destination physical register, if instruction writes a register » Entry in reorder buffer to provide in-order commit » Entry in issue window to wait for execution » Entry in memory buffer, if load or store – Decode will stall if resources not available – Rename source and destination registers – Check source registers for readiness – Insert instruction into issue window+reorder buffer+memory bufferMarch 14, 2011 CS152, Spring 2011 5 Memory Instructions • Split store instruction into two pieces during decode: – Address calculation, store-address – Data movement, store-data • Allocate space in program order in memory buffers during decode • Store instructions: – Store-address calculates address and places in store buffer – Store-data copies store value into store buffer – Store-address and store-data execute independently out of issue window – Stores only commit to data cache at commit point • Load instructions: – Load address calculation executes from window – Load with completed effective address searches memory buffer – Load instruction may have to wait in memory buffer for earlier store ops to resolveMarch 14, 2011 CS152, Spring 2011 6 Issue Stage • Writebacks from completion phase “wakeup” some instructions by causing their source operands to become ready in issue window – In more speculative machines, might wake up waiting loads in memory buffer • Need to “select” some instructions for issue – Arbiter picks a subset of ready instructions for execution – Example policies: random, lower-first, oldest-first, critical-first • Instructions read out from issue window and sent to executionMarch 14, 2011 CS152, Spring 2011 7 Execute Stage • Read operands from physical register file and/or bypass network from other functional units • Execute on functional unit • Write result value to physical register file (or store buffer if store) • Produce exception status, write to reorder buffer • Free slot in instruction windowMarch 14, 2011 CS152, Spring 2011 8 Commit Stage • Read completed instructions in-order from reorder buffer – (may need to wait for next oldest instruction to complete) • If exception raised – flush pipeline, jump to exception handler • Otherwise, release resources: – Free physical register used by last writer to same architectural register – Free reorder buffer slot – Free memory reorder buffer slotMarch 14, 2011 CS152, Spring 2011 9 Superscalar Control Logic Scaling • Each issued instruction must somehow check against W*L instructions, i.e., growth in hardware ∝ W*(W*L) • For in-order machines, L is related to pipeline latencies and check is done during issue (interlocks or scoreboard) • For out-of-order machines, L also includes time spent in instruction buffers (instruction window or ROB), and check is done by broadcasting tags to waiting instructions at write back (completion) • As W increases, larger instruction window is needed to find enough parallelism to keep machine busy => greater L => Out-of-order control logic grows faster than W2 (~W3) Lifetime L Issue Group Previously Issued Instructions Issue Width WMarch 14, 2011 CS152, Spring 2011 10 Out-of-Order Control Complexity: MIPS R10000 Control Logic [ SGI/MIPS Technologies Inc., 1995 ]March 14, 2011 CS152, Spring 2011 11 Check instruction dependencies Superscalar processor Sequential ISA Bottleneck a = foo(b); for (i=0, i< Sequential source code Superscalar compiler Find independent operations Schedule operations Sequential machine code Schedule executionMarch 14, 2011 CS152, Spring 2011 12 VLIW: Very Long Instruction Word • Multiple operations packed into one instruction • Each operation slot is for a fixed function • Constant operation latencies are specified • Architecture requires guarantee of: – Parallelism within an instruction => no cross-operation RAW check – No data use before data ready => no data interlocks Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1March 14, 2011 CS152, Spring 2011 13 Early VLIW Machines • FPS AP120B (1976) – scientific attached array processor – first commercial wide instruction machine – hand-coded vector math libraries using software pipelining and loop unrolling • Multiflow Trace (1987) – commercialization of ideas from Fisher’s Yale group including “trace scheduling” – available in configurations with 7, 14, or 28 operations/instruction – 28 operations packed into a 1024-bit instruction word • Cydrome Cydra-5 (1987) –
View Full Document