DOC PREVIEW
Berkeley COMPSCI 152 - Quiz 4 Solutions

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Problem Q4.1: Out-of-Order SchedulingTable Q4.1-1Common mistakes: Forgetting in-order commit, not reusing T0 and T1 appropriatelyProblem Q4.2: Fetch PipelinesBen should add a cache of the most recently encountered return instruction addresses. During F1, the contents of the cache are looked up to see if any entries match the current program counter. If so, then by the end of F1 (instead of D2) we know that we have a return instruction. We can then use the return stack to supply the return address.Computer Architecture and Engineering CS152 Quiz #4 SolutionsProblem Q4.1: Out-of-Order Scheduling Problem Q4.1.ATimeOP Dest Src1 Src2Decode ROBIssued WB CommittedI1-1 0 1 2L.DT0 R2 -I20 2 12 13MUL.DT1 T0 F0I31 13 15 16ADD.DT2 T1 F0I42 3 5 17ADDIT3 R2 -I53 4 6 18L.DT4 T3 -I64 7 17 19MUL.DT5 T4 T4I75 18 20 21ADD.DT6 T5 T2Table Q4.1-1Common mistakes: Forgetting in-order commit, integer result bypassing, single-ported ROB/register files, issue dependent on writeback of sources and completed decode Problem Q4.1.BTimeOP Dest Src1 Src2Decode ROBIssued WB CommittedI1-1 0 1 2L.DT0 R2 -I20 2 12 13MUL.DT1 T0 F0I33 13 15 16ADD.DT0 T1 F0I414 15 17 18ADDIT1 R2 -I517 18 19 20 L.DT0 T1 -I619 20 30 31 MUL.DT1 T0 T0I721 31 33 34 ADD.DT0 T1 F3Table Q4.1-2Common mistakes: Forgetting in-order commit, not reusing T0 and T1 appropriatelyProblem Q4.2: Fetch Pipelines PC PC GenerationF1ICache AccessF2D1Instruction DecodeD2RN Rename/ReorderRF Register File ReadEX Integer ExecuteProblem Q4.2.A Pipelining Subroutine ReturnsImmediately after what pipeline stage does the processor know that it is executing a subroutine return instruction? D2Immediately after what pipeline stage does the processor know the subroutine return address? RFHow many pipeline bubbles are required when executing a subroutine return?6Problem Q4.2.B Adding a BTBA subroutine can be called from many different locations and thus a single subroutine return can return to different locations. A BTB holds only the address of the last caller.Problem Q4.2.C Adding a Return StackNormally, instruction fetch needs to wait until the return instruction finishes the RF stage before the return address is known. With the return stack, as soon as the return instruction is decoded in D2, instruction fetch can begin fetching from the return address. This saves 2 cycles.A return address is pushed after a JAL/JALR instruction is decoded in D2. A return address is popped after a JR r31 instruction is decoded in D2.Problem Q4.2.D Return Stack OperationA: JAL BA+1:A+2:…B: JR r31B+1:B+2:…instructiontimeA PC F1 F2 D1 D2 RN RF EXA+1 PC F1 F2 D1 D2 RN RF EXA+2 PC F1 F2 D1 D2 RN RF EXA+3 PC F1 F2 D1 D2 RN RF EXA+4 PC F1 F2 D1 D2 RN RF EXB PC F1 F2 D1 D2 RN RF EXB+1 PC F1 F2 D1 D2 RN RF EXB+2 PC F1 F2 D1 D2 RN RF EXB+3 PC F1 F2 D1 D2 RN RF EXB+4 PC F1 F2 D1 D2 RN RF EXA+1 PC F1 F2 D1 D2 RN RF EXProblem Q4.2.E Handling Return Address MispredictsWhen a value is popped off the return stack after D2, it is saved for two cycles as part of the pipeline state. After the RF stage of the return instruction, the actual r31 is compared against the predicted return address. If the addresses match, then we are done. Otherwisewe mux in the correct program counter at the PC stage and kill the instructions in F1 and F2. Depending on how fast the address comparison is assumed to be, you might also kill the instruction in D1. So there is an additional 2 or 3 cycles on a return mispredict.Problem Q4.2.F Further Improving PerformanceBen should add a cache of the most recently encountered return instruction addresses.During F1, the contents of the cache are looked up to see if any entries match the currentprogram counter. If so, then by the end of F1 (instead of D2) we know that we have areturn instruction. We can then use the return stack to supply the return address.Problem Q4.3: Bounds on ILPProblem Q4.3.A Maximum ILPThe solution to this question comes from Little's Law. Every instruction in flight needs a separate destination physical register, unless it does not produce a result. We also need enough physical registers to map the committed architectural registers.We want to find the minimum number of physical registers for some piece of code to saturate the machine pipelines. The machine cannot fetch and commit more than four instructions per cycle, so we pick the four lowest latency functional units (i.e., the two integer and the two memory) out of the six available, as these will require the least number of instructions in flight to saturate the machine's pipelines.Instructions allocate a physical register in the decode stage, then need to hold it for at least the issue stage, plus the execute stage(s), plus the writeback stage, plus the commit stage. This is at least 3 cycles in addition to the number of execute stages.An acceptable answer was: minimum physical = 2 * (3 + 1) // Two integer instructions 2 * (3 + 2) // Two memory instructions + # architectural registers = 18 + # architectural registersAnswers based on this Little's argument received 6/7 points, minus 1 point if the # architectural registers was not included.We obtain a better solution if we realize that we only need to find one piece of code that saturates the pipelines. Stores and branches do not need to allocate a destination register. Consider the following unrolled loop which zeros a region of memory, written in MIPS assembly code: loop: sw r0, 0(r1) sw r0, 4(r1) bne r1, r2, loop addu r1, r1, 8 # in delay slotThis loop will reach a steady state of one iteration per cycle, and will saturate the machine(four instructions per cycle) while allocating only one physical register per cycle (also freeing one physical register per cycle). The number of physical registers neededfor this loop is only: minimum physical = 4 // Addu instruction. + # architectural registersThis possibility got full credit (7/7).Another possible reading of the question was to find a piece of code that had a different performance bottleneck, e.g., a data dependency, and find the number of physical registersrequired for it. (This wouldn't actually saturate the machines pipelines). These answers received almost full credit depending on quality of answer.Common errors:Many students incorrectly thought that each instruction needs to allocate three physical registers, one for each source


View Full Document

Berkeley COMPSCI 152 - Quiz 4 Solutions

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Quiz 4 Solutions
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Quiz 4 Solutions and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Quiz 4 Solutions 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?