Berkeley COMPSCI 152 - Quiz 5 - D2991012

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Quiz 5

Berkeley COMPSCI 152 - Quiz 5

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 9

Download Save

Unformatted text preview:

Computer Architecture and Engineering CS152 Quiz 5 April 23rd 2009 Professor Krste Asanovic Name Answer Key This is a closed book closed notes exam 80 Minutes 8 Pages Notes Not all questions are of equal difficulty so look over the entire exam and budget your time carefully Please carefully state any assumptions you make Please write your name on every page in the quiz You must not discuss a quiz s contents with students who have not yet taken the quiz If you have inadvertently been exposed to the quiz prior to taking it you must tell the instructor or TA You will get no credit for selecting multiple choice answers without giving explanations Writing name on each sheet 1 Point Question 1 20 Points Question 2 18 Points Question 3 17 Points Question 4 24 Points TOTAL 80 Points NAME Problem Q5 1 VLIW 20 points In this problem you will port code to a simple VLIW machine and modify it to improve performance Details about the VLIW machine Three fully pipelined functional units Integer ALU Memory and Floating Point Integer ALU has a 1 cycle latency Memory Unit has a 2 cycle latency FPU has a 3 cycle latency and can complete one add or one multiply but not both per clock cycle No interlocks C Code for int i 0 i N i C i A i A i B i Assembly Code loop ld ld fmul fadd st addi addi addi bne f1 f2 f1 f1 f1 r1 r2 r3 r3 0 r1 0 r2 f1 f1 f1 f2 0 r3 r1 4 r2 4 r3 4 r4 loop Problem Q5 1 A 7 points Schedule operations into the VLIW instructions in the following table Show only one iteration of the loop Make the code efficient but do not use software pipelining or loop unrolling You do not need to write in NOPs can leave blank ALU addi r1 r1 4 addi r2 r2 4 Memory Unit ld f1 0 r1 ld f2 0 r2 FPU fmul f1 f1 f1 fadd f1 f1 f2 addi r3 r3 4 bne r3 r4 loop st f1 4 r3 What performance did you achieve in FLOPS per cycle 2 9 NAME Problem Q5 1 B 7 points Unroll the loop by one iteration so two iterations of the original loop are performed for every branch in the new assembly code You only need to worry about the steady state code in the core of the loop no epilogue or prologue Make the code efficient but do not use software pipelining You do not need to write in NOPs can leave blank ALU addi r1 r1 8 addi r2 r2 8 ld ld ld ld Memory Unit f1 0 r1 f3 4 r1 f2 0 r2 f4 4 r2 FPU fmul f1 f1 f1 fmul f3 f3 f3 fadd f1 f1 f2 fadd f3 f3 f4 addi r3 r3 8 bne r3 r4 loop st f1 0 r3 st f3 4 r3 What performance is achieved now in FLOPS per cycle Problem Q5 1 C 4 10 6 points With unlimited registers if the loop was fully optimized loop unrolling and software pipelining how many FLOPS per cycle could it achieve What is the bottleneck Hint You should not have to write out the assembly code It could achieve 2 3 flops per cycle It will be bottlenecked by memory accesses since each iteration has 3 memory ops 2 loads and 1 store and only 2 floating point ops and there is only one functional unit for each Many people picked the FPU because it has the longest latency In steady state it is a matter of throughput rather than latency NAME Problem Q5 2 Vector 18 points In this problem we will examine how vector architecture implementations could affect the performance of various codes As a baseline implementation assume 64 elements per vector register 8 lanes One ALU per lane 2 cycle latency One load store unit per lane 8 cycle latency No dead time No support for chaining Scalar instructions execute on a separate five stage pipeline Between two given alternatives pick the modification that will yield the greatest performance improvement and explain why assuming everything else is held constant Be sure to explain why the other choice will not help as much Problem Q5 2 A Vector Assembly LV ADDV MULV ADDV ADDV V0 V1 V2 V3 V1 R1 V1 V2 V3 V1 V2 V2 V4 V0 Circle one Double number of lanes Add support for chaining 6 points There will be no gain from chaining because there aren t any stalls caused by dependencies The instructions aren t all entirely independent since the last two instructions use previously computed vectors If you work out the latencies you will see that V3 and V1 will be ready before they are needed so there will be no stalls Doubling the number of lanes will improve performance because 16 lanes is still less than the vector length NAME Problem Q5 2 B C Code Vector reduction from lecture VL is vector length power of 2 do VL VL 2 sum 0 VL 1 sum VL 2 VL 1 while VL 1 Circle one Double number of lanes Double vector unit clock frequency Problem Q5 2 C Vector Assembly LV MULV LV ADDV SV V0 V0 V1 V1 V1 R1 V0 V0 R2 V1 V0 R2 Circle one Double number of lanes Add support for chaining 6 points In this vector reduction code the vector length keeps getting halved This means for a significant portion of its execution it will be using short vectors Additional lanes can t help with short vectors Doubling the clock rate will still offer the theoretical doubling of throughput but it will be able to achieve that even with short vectors 6 points Many of these instructions are dependent so even with more lanes the system will need to stall for dependencies Chaining will allow for the biggest performance improvement NAME Problem Q5 3 Multithreading 17 points Consider the following code on a multithreaded architecture You can assume each thread is running the same program You can assume Single issue and in order machine pipeline ALU is fully pipelined with a latency of 2 Branches conditional and unconditional take 2 cycles to complete if branch is started on cycle 1 the following instruction can t start until cycle 3 Memory Unit is fully pipelined with a latency of 16 Code loop beq lw add lw j r1 r0 end r2 4 r1 r3 r3 r2 r1 0 r1 loop 1 3 19 20 21 36 end Problem Q5 3 A 5 points How many cycles does it take for one iteration of the loop to complete if the system is single threaded 36 1 35 cycles The jump executes while the last lw is in progress but the next iteration can t start until the load is done Problem Q5 3 B 6 points If the system is multithreaded with fixed round robin scheduling how many threads are needed to fully utilize the pipeline It needs to cover 15 cycles of latency between lw and …

View Full Document

Berkeley COMPSCI 152 - Quiz 5

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 152 - Quiz 5

Sign up for free to view:

Please select your school