DOC PREVIEW
CMU CS 15745 - lecture

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 17 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-745 © Seth Copen Goldstein 2000-5 115-745Instruction SchedulingCopyright © Seth Copen Goldstein 2000-5(some slides borrowed from M. Voss)15-745 © Seth Copen Goldstein 2000-5 2Instruction-level Parallelism• Most modern processors have the ability to execute several adjacent instructions simultaneously.– Pipelined machines.– Very-long-instruction-word machines (VLIW).– Superscalar machines.– Dynamic scheduling/out-of-order machines.• ILP is limited by several kinds of executionconstraints:– Data dependence constraints.– Resource constraints (“hazards”)– Control hazards15-745 © Seth Copen Goldstein 2000-5 3Execution Constraints• Data-dependence constraints:– If instruction A computes a value that is read by instruction B, then B cannot execute before A is completed.• Resource hazards:– Limited # of functional units.• If there are n functional units of a particular kind (e.g., n multipliers), then only n instructions that require that kind of unit can execute at once.– Limited instruction issue.• If the instruction-issue unit can issue only n instructions at a time, then this limits ILP.– Limited register set.• Any schedule of instructions must have a valid register allocation.For example: ld [%fp-28], %o1add %o1, %l2, %l315-745 © Seth Copen Goldstein 2000-5 4Instruction Scheduling• The purpose of instruction scheduling (IS) is to order the instructions for maximum ILP.– Keep all resources busy every cycle.– If necessary, eliminate data dependences and resource hazards to accomplish this.• The IS problem is NP-complete (and bad in practice).– So heuristic methods are necessary.How can you tell this is an old slide?15-745 © Seth Copen Goldstein 2000-5 5Instruction Scheduling•There are many different techniques for IS.– Still an open area of research.• Most optimizing compilers perform good local IS, and only simple global IS.• The biggest opportunities are in scheduling the code for loops.15-745 © Seth Copen Goldstein 2000-5 6Should the Compiler Do IS?• Many modern machines perform dynamic reordering of instructions.– Also called “out-of-order execution” (OOOE).– Not yet clear whether this is a good idea.– Pro:• OOOE can use additional registers and register renaming to eliminate data dependences that no amount of static IS can accomplish.• No need to recompile programs when hardware changes.–Con:• OOOE means more complex hardware (and thus longer cycle times and more wattage).• And can’t be optimal since IS is NP-complete.15-745 © Seth Copen Goldstein 2000-5 7What we will cover• Scheduling basic blocks– List scheduling– Long-latency operations–Delay slots• Scheduling for clusters architectures• Software Pipelining (next week)• What we need to know– pipeline structure– data dependencies– register renaming15-745 © Seth Copen Goldstein 2000-5 8• In the von Neumann model of execution an instruction starts only after its predecessor completes.• This is not a very efficient model of execution.–von Neumann bottleneck or the memory wall.timeinstr 1 instr 2Instruction Scheduling15-745 © Seth Copen Goldstein 2000-5 9Instruction Pipelines• Almost all processors today use instructions pipelines to allow overlap of instructions (Pentium 4 has a 20 stage pipeline!!!).• The execution of an instruction is divided into stages; each stage is performed by a separate part of the processor.• Each of these stages completes its operation in one cycle (shorter the the cycle in the von Neumann model).• An instruction still takes the same time to execute.F D E M WF: Fetch instruction from cache or memory.D: Decode instruction.E: Execute. ALU operation or address calculation.M: Memory access.W: Write back result into register.timeinstr15-745 © Seth Copen Goldstein 2000-5 10Instruction Pipelines• However, we overlap these stages in time to complete an instruction every cycle.F D E M Wtimeinstr 1F D E M WF D E M WF D E M WF D E M Winstr 2instr 3instr 4instr 5F D E M WF D E M Winstr 6instr 7Filling thepipelineDraining thepipelineSteady state415-745 © Seth Copen Goldstein 2000-5 11Pipeline Hazards• Structural Hazards– two instructions need the same resource at the same time– memory or functional units in a superscalar.• Data Hazards– an instructions needs the results of a previous instructionr1 = r2 + r3r4 = r1 + r1r1 = [r2]r4 = r1 + r1– solved by forwarding and/or stalling–cache miss?• Control Hazards– jump & branch address not known until later in pipeline– solved by delay slot and/or prediction15-745 © Seth Copen Goldstein 2000-5 12Jump/Branch Delay Slot(s)• Control hazards, i.e. jump/branch instructions.F D E M WF D E M WF D E M WF D E M Wjump/branchinstr 2instr 3instr 4unconditional jump address available only after Decode.conditional branch address available only after Execute.15-745 © Seth Copen Goldstein 2000-5 13Jump/Branch Delay Slot(s)• One option is to stall the pipeline (hardware solution). • Another option is to insert a no-op instructions (software).• Both degrade performance!F D E M WF D E M Wjumpinstr 2F D E M WF D E M WF D E M Wjumpinstr 2nop15-745 © Seth Copen Goldstein 2000-5 14Jump/Branch Delay Slot(s)• another option is for the branch take effect afterthe delay slots.• I.e., some instructions always get executed after the branch but before the branching takes effect.F D E M WF D E M WF D E M WF D E M Wbrainstr 2F D E M Winstr 3instr xinstr y15-745 © Seth Copen Goldstein 2000-5 15Jump/Branch Delay Slots• In other words, the instruction(s) in the delay slots of the jump/branch instruction always get(s) executed when the branch is executed (regardless of the branch result).• Fetching from the branch target begins only after these instructions complete.• What instruction(s) to use?bgt r3, L1::L1:15-745 © Seth Copen Goldstein 2000-5 16Branch Prediction• Current processors will speculatively execute at conditional branches– if a branch direction is correctly guessed, great!– if not, the pipeline is flushed before instructions commit (WB).• Why not just let compiler schedule?– The average number of instructions per basic block in typical C code is about 5 instructions.– branches are not statically predictable– What happens if you have a 20 stage pipeline?15-745 © Seth Copen Goldstein 2000-5 17Data Hazardsr1 = r2 + r3r4 = r1 + r1r1 = [r2]r4 = r1 + r1F D E M WF D E M Wr2 + r3 available here[r2] available


View Full Document

CMU CS 15745 - lecture

Documents in this Course
Lecture

Lecture

14 pages

Lecture

Lecture

19 pages

Lecture

Lecture

8 pages

Lecture

Lecture

5 pages

Lecture

Lecture

6 pages

Lecture 3

Lecture 3

12 pages

Lecture

Lecture

17 pages

Lecture

Lecture

18 pages

lecture

lecture

14 pages

lecture

lecture

8 pages

lecture

lecture

5 pages

Lecture

Lecture

19 pages

lecture

lecture

10 pages

Lecture

Lecture

20 pages

Lecture

Lecture

8 pages

Lecture

Lecture

7 pages

lecture

lecture

59 pages

Lecture

Lecture

10 pages

Task 2

Task 2

2 pages

Handout

Handout

18 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?