Berkeley COMPSCI 252 - Lec 7 – Instruction Level Parallelism - D2359353

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Lec 7 – Instruction Level Parallelism

DOC PREVIEW

Berkeley COMPSCI 252 - Lec 7 – Instruction Level Parallelism

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 20

This preview shows page 1-2-19-20 out of 20 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

EECS 252 Graduate Computer ArchitectureLec 7 – Instruction Level Parallelism David PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~pattrsnhttp://vlsi.cs.berkeley.edu/cs252-s06 2/8/2006 CS252 S06 Lec7 ILP2Review from Last Time• 4 papers: All about where to draw line between HW and SW• IBM set foundations for ISAs since 1960s– 8-bit byte – Byte-addressable memory (as opposed to word-addressable memory) – 32-bit words – Two's complement arithmetic (but not the first processor) – 32-bit (SP) / 64-bit (DP) Floating Point format and registers – Commercial use of microcoded CPUs– Binary compatibility / computer family• B5000 very different model: HLL only, stack, Segmented VM• IBM paper made case for ISAs good for microcodedprocessors ⇒ leading to CISC • Berkeley paper made the case for ISAs for pipeline + cache micrprocessors (VLSI) ⇒ leading to RISC• Who won RISC vs. CISC? VAX is dead. Intel 80x86 on desktop, RISC in embedded, Servers x86 and RISC2/8/2006 CS252 S06 Lec7 ILP3Outline• ILP• Compiler techniques to increase ILP• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• (Start) Tomasulo Algorithm• Conclusion2/8/2006 CS252 S06 Lec7 ILP4Recall from Pipelining Review• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls– Ideal pipeline CPI: measure of the maximum performance attainable by the implementation– Structural hazards: HW cannot support this combination of instructions– Data hazards: Instruction depends on result of prior instruction still in the pipeline– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)2/8/2006 CS252 S06 Lec7 ILP5Instruction Level Parallelism• Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance• 2 approaches to exploit ILP:1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and2) Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)• Next 4 lectures on this topic2/8/2006 CS252 S06 Lec7 ILP6Instruction-Level Parallelism (ILP)• Basic Block (BB) ILP is quite small– BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit– average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches– Plus instructions in BB likely to depend on each other• To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks• Simplest: loop-level parallelismto exploit parallelism among iterations of a loop. E.g.,for (i=1; i<=1000; i=i+1)x[i] = x[i] + y[i];2/8/2006 CS252 S06 Lec7 ILP7Loop-Level Parallelism• Exploit loop-level parallelism to parallelism by “unrolling loop” either by 1. dynamic via branch prediction or 2. static via loop unrolling by compiler(Another way is vectors, to be covered later)• Determining instruction dependence is critical to Loop Level Parallelism• If 2 instructions are– parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards)– dependent, they are not parallel and must be executed in order, although they may often be partially overlapped2/8/2006 CS252 S06 Lec7 ILP8• InstrJis data dependent (aka true dependence) on InstrI:1. InstrJtries to read operand before InstrIwrites it2. or InstrJis data dependent on InstrKwhich is dependent on InstrI• If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped• Data dependence in instruction sequence ⇒ data dependence in source code ⇒ effect of original data dependence must be preserved• If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazardData Dependence and HazardsI: add r1,r2,r3J: sub r4,r1,r32/8/2006 CS252 S06 Lec7 ILP9ILP and Data Dependencies,Hazards• HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by original source program– Dependences are a property of programs• Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline• Importance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possibly be exploited• HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program2/8/2006 CS252 S06 Lec7 ILP10• Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence• InstrJwrites operand before InstrIreads itCalled an “anti-dependence” by compiler writers.This results from reuse of the name “r1”• If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazardI: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7Name Dependence #1: Anti-dependence2/8/2006 CS252 S06 Lec7 ILP11Name Dependence #2: Output dependence• InstrJwrites operand before InstrIwrites it.• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”• If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard• Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict– Register renaming resolves name dependence for regs– Either by compiler or by HWI: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r72/8/2006 CS252 S06 Lec7 ILP12Control Dependencies• Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program orderif p1 {S1;};if p2 {S2;}• S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.2/8/2006 CS252 S06 Lec7 ILP13Control Dependence Ignored• Control dependence need not be preserved– willing to execute instructions that should not have been executed, thereby violating the control dependences, ifcan do so without affecting

View Full Document