WVU CPE 242 - Instruction Level Parallelism
Course Cpe 242-
Pages 46

Unformatted text preview:

cs 152 ilp 1 CpE 242 Computer Architecture and Engineering Instruction Level Parallelism DAP SIK 1995 Recap Interconnection Network Implementation Issues Interconnect MPP LAN WAN Example CM 5 Ethernet ATM Maximum length between nodes 25 m 500 m 5 repeaters copper 100 m optical 1000 m Number data lines 4 1 1 Clock Rate 40 MHz 10 MHz 155 5 MHz Shared vs Switch Switch Shared Switch Maximum number 2048 254 10 000of nodes Media Material wire fiber Copper Twisted pair copper wire or cable cs 152 ilp 2 Twisted pair or Coaxial copper optical DAP SIK 1995 Recap Implementation Issues Advantages of Serial vs Parallel lines No synchronizing signals Higher clock rate and longer distance than parallel lines e g 60 MHz x 256 bits x 0 5 m vs 155 MHz x 1 bit x 100 m Imperfections in the copper wires or integrated circuit pad drivers can cause skew in the arrival of signals limiting the clock rate and the length and number of the parallel lines Switched vs Shared Media pairs communicate at same time point topoint connections cs 152 ilp 3 DAP SIK 1995 cs 152 ilp 4 Recap Other Interconnection Network Issues Interconnect MPP LAN WAN Example CM 5 Ethernet ATM Topology constructed Fat tree Line from multistage Variable switches Connection based No No Yes Data Transfer Size 1500B Variable 48B Variable Fixed 4 to 20B 0 to DAP SIK 1995 Recap Network Performance Measures Overhead latency of interface vs Latency network cs 152 ilp 5 DAP SIK 1995 Recap Interconnection Network Summary Communication between computers Packets for standards protocols to cover normal and abnormal events Implementation issues length width media Performance issues overhead latency bisection BW Topologies many to chose from but SW overheads make them look the alike cost issues in topologies cs 152 ilp 6 DAP SIK 1995 Outline of Today s Lecture Recap 5 minutes Introduction to Instruction Level Parallelism 15 minutes Superpipeline superscalar VLIW Register renaming 5 minutes Out of order execution 5 minutes Branch Prediction 5 minutes Limits to ILP 15 minutes Summary 5 minutes cs 152 ilp 7 DAP SIK 1995 cs 152 ilp 8 Advanced Pipelining and Instruction Level Parallelism gcc 17 control transfer 5 instructions 1 branch beyond single block to get more instruction level parallelism Loop level parallelism one opportunity SW and HW DAP SIK 1995 What s going on in the loop Basic Loop load a Ai load y Yi mult m a s add r m y store Ai r inc Ai inc Yi dec i branch about 9 inst per 2 FP ops cs 152 ilp 9 Unrolled Loop load load mult add store load load mult add store load load mult add store load load mult add store inc inc dec branch about 6 inst per 2 FP ops dependencies between instructions remain Reordered Unrolled Loop load load load mult mult mult mult add add add add store store store store inc inc dec branch schedule 24 inst basic block relative to pipeline delay slots function unit stalls multiple function units pipeline depth DAP SIK 1995 Software Pipelining cs 152 ilp 10 Observation if iterations from loops are independent then can get ILP by taking instructions from different iterations Software pipelining reorganizs loops such that each iteration is made from instructions chosen from different iterations of the original loop Tomasulo in SW Iteration 0 Iteration Iteration 1 2 Iteration 3 Iteration 4 Softwarepipelined iteration DAP SIK 1995 SW Pipelining Example Before Unrolled 3 times After Software Pipelined 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0 0 R1 F4 F0 F2 0 R1 F4 F6 8 R1 F8 F6 F2 8 R1 F8 F10 16 R1 F12 F10 F2 16 R1 F12 R1 R1 24 R1 LOOP SD ADDD LD SUBI BNEZ 0 R1 F4 Stores M i F4 F0 F2 Adds to M i 1 F10 16 R1 loads M i 2 R1 R1 16 R1 LOOP Symbolic Loop Unrolling Less code space Overhead paid only once vs each iteration in loop unrolling cs 152 ilp 11 DAP SIK 1995 How can the machine exploit available ILP Limitation Technique Pipelining IF D Ex M IF D Ex IF D IF W M W Ex M W D Ex M W Issue rate FU stalls FU depth Super pipeline Issue 1 instr fast cycle IF takes multiple cycles IF D Ex M W IF D Ex M W IF D Ex M W IF D Ex M W Clock skew FU stalls FU depth Super scalar Issue multiple scalar instructions per cycle VLIW Each instruction specifies multiple scalar operations cs 152 ilp 12 IF D Ex IF D Ex IF D IF D M M Ex Ex IF D Ex Ex Ex Ex M M M M W W M W M W W W W W Hazard resolution Packing DAP SIK 1995 Case Study MIPS R4000 100 MHz to 200 MHz cs 152 ilp 13 8 stage pipeline IF first half of fetching of instruction PC selection happens here as well as initiation of instruction cache access IS second half of access to instruction cache RF instruction decode and register fetch hazard checking and also instruction cache hit detection EX execution which includes effective address calculation ALU operation and branch target computation and condition evaluation DF data fetch first half of access to data cache DS second half of access to data cache TC tag check determine whether the data cache access hit WB write back for loads and register register operations 8 stages impact on Load delay Branch delay Why DAP SIK 1995 R4000 Performance cs 152 ilp 14 Not ideal CPI of 1 Base Load stalls Branch stalls FP result stalls tomcatv su2cor spice2g6 ora nasa7 doduc li gcc espresso Load stalls 1 or 2 clock cycles Branch stalls 2 cycles unfilled slots FP result stalls RAW data hazard latency FP structural stalls Not enough FP hardware parallelism eqntott 4 5 4 3 5 3 2 5 2 1 5 1 0 5 0 FP structural stalls DAP SIK 1995 Issues raised by Superscalar execution cs 152 ilp 15 Must look ahead and prefetch instructions Available parallelism Resources and available bandwidth Instruction Fetch Decode Branch prediction Instruction Window Hazard detection and aggressive resolution out of order issue WAR and WAW Execution Units Issue 0 N instructions to Ex Unit according to some policy register renaming to avoid false dependies out of order completion Exception handling DAP SIK 1995 Hardware Schemes for Instruction Parallelism Why in HW at run time Works when can t know dependence at run time compiler simpler code for one machine runs well on another Key idea Allow instructions behind stall to proceed DIVD F0 F2 F4 ADDD F10 F0 F8 SUBD F8 F8 F14 enables out of order execution out of order completion ID stage checked both for structural execution divides ID stage 1 Issue decode instructions check for structural hazards 2 Read operands wait until no data hazards then read operands Scoreboards allow instruction to execute whenever 1 2


View Full Document

WVU CPE 242 - Instruction Level Parallelism

Course: Cpe 242-
Pages: 46
Download Instruction Level Parallelism
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Instruction Level Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Instruction Level Parallelism 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?