DOC PREVIEW
Berkeley COMPSCI 152 - Quiz 5

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computer Architecture and Engineering CS152 Quiz #5April 23rd, 2009Professor Krste AsanovicName:___ Answer Key ____This is a closed book, closed notes exam.80 Minutes8 PagesNotes:- Not all questions are of equal difficulty, so look over the entire exam and budget your time carefully.- Please carefully state any assumptions you make.- Please write your name on every page in the quiz.- You must not discuss a quiz's contents with students who have not yet taken the quiz. If you have inadvertently been exposed to the quiz prior to taking it, you must tell the instructor or TA.- You will get no credit for selecting multiple-choice answers without giving explanations.Writing name on each sheet ________ 1 PointQuestion 1 ________ 20 PointsQuestion 2 ________ 18 PointsQuestion 3 ________ 17 PointsQuestion 4 ________ 24 PointsTOTAL ________ 80 PointsNAME:_________________________Problem Q5.1: VLIW 20 pointsIn this problem, you will port code to a simple VLIW machine, and modify it to improveperformance. Details about the VLIW machine:- Three fully pipelined functional units (Integer ALU, Memory, and Floating Point)- Integer ALU has a 1 cycle latency- Memory Unit has a 2 cycle latency- FPU has a 3 cycle latency and can complete one add or one multiply (but notboth) per clock cycle- No interlocksC Code: Assembly Code:for(int i=0; i<N; i++)C[i] = A[i]*A[i] + B[i];loop: ld f1, 0(r1) ld f2, 0(r2) fmul f1, f1, f1 fadd f1, f1, f2 st f1, 0(r3) addi r1, r1, 4 addi r2, r2, 4 addi r3, r3, 4 bne r3, r4, loopProblem Q5.1.A 7 pointsSchedule operations into the VLIW instructions in the following table. Show only oneiteration of the loop. Make the code efficient, but do not use software pipelining or loopunrolling. You do not need to write in NOPs (can leave blank).ALU Memory Unit FPUaddi r1, r1, 4 ld f1, 0(r1)addi r2, r2, 4 ld f2, 0(r2)fmul f1, f1, f1fadd f1, f1, f2addi r3, r3, 4bne r3, r4, loop st f1, -4(r3)What performance did you achieve? (in FLOPS per cycle): 2/9_________NAME:_________________________Problem Q5.1.B 7 pointsUnroll the loop by one iteration (so two iterations of the original loop are performed forevery branch in the new assembly code). You only need to worry about the steady-statecode in the core of the loop (no epilogue or prologue). Make the code efficient, but do notuse software pipelining. You do not need to write in NOPs (can leave blank).ALU Memory Unit FPUld f1, 0(r1)addi r1, r1, 8 ld f3, 4(r1)ld f2, 0(r2) fmul f1, f1, f1addi r2, r2, 8 ld f4, 4(r2) fmul f3, f3, f3fadd f1, f1, f2fadd f3, f3, f4addi r3, r3, 8 st f1, 0(r3)bne r3, r4, loop st f3, -4(r3)What performance is achieved now? (in FLOPS per cycle): 4/10_________Problem Q5.1.C 6 pointsWith unlimited registers, if the loop was fully optimized (loop unrolling and softwarepipelining), how many FLOPS per cycle could it achieve? What is the bottleneck?(Hint: You should not have to write out the assembly code.)It could achieve 2/3 flops per cycle. It will be bottlenecked by memory accesses, sinceeach iteration has 3 memory ops (2 loads and 1 store) and only 2 floating point ops, andthere is only one functional unit for each.Many people picked the FPU because it has the longest latency. In steady state, it is amatter of throughput rather than latency.NAME:_________________________Problem Q5.2: Vector 18 pointsIn this problem we will examine how vector architecture implementations could affectthe performance of various codes. As a baseline implementation, assume:- 64 elements per vector register- 8 lanes- One ALU per lane; 2 cycle latency- One load/store unit per lane; 8 cycle latency- No dead time- No support for chaining- Scalar instructions execute on a separate five-stage pipelineBetween two given alternatives, pick the modification that will yield the greatestperformance improvement and explain why (assuming everything else is held constant).Be sure to explain why the other choice will not help as much.Problem Q5.2.A 6 points# Vector AssemblyLV V0, R1ADDV V1, V1, V2MULV V2, V2, V2ADDV V3, V3, V4ADDV V1, V1, V0There will be no gain from chainingbecause there aren’t any stalls caused bydependencies. The instructions aren’t allentirely independent since the last twoinstructions use previously computedvectors. If you work out the latencies,you will see that V3 and V1 will beready before they are needed so therewill be no stalls.Doubling the number of lanes willimprove performance because 16 lanesis still less than the vector length.Circle one:- Double number of lanes- Add support for chainingNAME:_________________________Problem Q5.2.B 6 points# C Code# Vector reduction from lecture# VL is vector length (power of 2)do {VL = VL/2;sum[0:VL-1] += sum[VL:2*VL-1];} while (VL>1);In this vector reduction code, thevector length keeps gettinghalved. This means for asignificant portion of itsexecution it will be using shortvectors. Additional lanes can’thelp with short vectors.Doubling the clock rate will stilloffer the theoretical doubling ofthroughput, but it will be able toachieve that even with shortvectors. Circle one:- Double number of lanes- Double vector unit clock frequencyProblem Q5.2.C 6 points# Vector AssemblyLV V0, R1MULV V0, V0, V0LV V1, R2ADDV V1, V1, V0SV V1, R2 Many of these instructions aredependent, so even with morelanes, the system will need tostall for dependencies. Chainingwill allow for the biggestperformance improvement. Circle one:- Double number of lanes- Add support for chainingNAME:_________________________Problem Q5.3: Multithreading 17 pointsConsider the following code on a multithreaded architecture. You can assume each threadis running the same program. You can assume:- Single-issue and in-order machine pipeline- ALU is fully pipelined with a latency of 2- Branches (conditional and unconditional) take 2 cycles to complete (if branch isstarted on cycle 1, the following instruction can’t start until cycle 3)- Memory Unit is fully pipelined with a latency of 16Code:loop: beq r1, r0, end 1 36lw r2, 4(r1) 3add r3, r3, r2 19lw r1, 0(r1) 20j loop 21end:Problem Q5.3.A 5 pointsHow many cycles does it take for one iteration of the loop to complete if the system issingle threaded?36 – 1 = 35 cyclesThe jump executes while the last lw is in progress, but the next iteration can’t start


View Full Document

Berkeley COMPSCI 152 - Quiz 5

Documents in this Course
Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Quiz 5
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Quiz 5 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Quiz 5 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?