Assigned April 7 CS152 Computer Architecture and Engineering VLIW Vector and Multithreaded Machines Problem Set 5 Due April 21 http inst eecs berkeley edu cs152 sp09 The problem sets are intended to help you learn the material and we encourage you to collaborate with other students and to ask questions in discussion sections and office hours to understand the problems However each student must turn in their own solutions to the problems The problem sets also provide essential background material for the quizzes The problem sets will be graded primarily on an effort basis but if you do not work through the problem sets you are unlikely to succeed at the quizzes We will distribute solutions to the problem sets on the day the problem sets are due to give you feedback Homework assignments are due at the beginning of class on the due date Homework will not be accepted once solutions are handed out Problem P5 1 Trace Scheduling Trace scheduling is a compiler technique that increases ILP by removing control dependencies allowing operations following branches to be moved up and speculatively executed in parallel with operations before the branch It was originally developed for statically scheduled VLIW machines but it is a general technique that can be used in different types of machines and in this question we apply it to a single issue MIPS processor Consider the following piece of C code is modulus with basic blocks labeled A B C D E F G if data X V0 else X V2 if data Y V0 else Y V2 8 0 V1 V3 4 0 V1 V3 Assume that data is a uniformly distributed integer random variable that is set sometime before executing this code The program s control flow graph is The decision tree is A A B C B C D D D E F G E F E F Path G probabilities for 5 A G G G A control flow graph and the decision tree both show the possible flow of execution through basic blocks However the control flow graph captures the static structure of the program while the decision tree captures the dynamic execution history of the program Page 2 of 18 Problem P5 1 A On the decision tree label each path with the probability of traversing that path For example the leftmost block will be labeled with the total probability of executing the path ABDEG Hint you might want to write out the cases Circle the path that is most likely to be executed Problem P5 1 B This is the MIPS code no delay slots A B C D E lw andi bnez div j div andi bnez mul j mul r1 r2 r2 r3 D r3 r2 r2 r8 G r8 data r1 7 r2 r1 8 C r4 r5 X V0 V1 r6 r7 X V2 V3 r1 3 r2 r1 4 F r4 r5 Y V0 V1 F r6 r7 Y V2 V3 G This code is to be executed on a single issue processor without branch speculation Assume that the memory divider and multiplier are all separate long latency unpipelined units that can be run in parallel Rewrite the above code using trace scheduling Optimize only for the most common path Just get the other paths to work Don t spend your time performing any other optimizations Ignore the possibility of exceptions Hint Write the most common path first then add fix up code Problem P5 1 C Assume that the load takes x cycles divide takes y cycles and multiply takes z cycles Approximately how many cycles does the original code take ignore small constants Approximately how many cycles does the new code take in the best case Page 3 of 18 Problem P5 2 VLIW machines The program we will use for this problem is listed below In all questions you should assume that arrays A B and C do not overlap in memory C code for i 0 i 328 i A i A i B i C i C i A i In this problem we will deal with the code sample on a VLIW machine Our machine will have six execution units two ALU units latency one cycle also used for branch operations two memory units latency three cycles fully pipelined each unit can perform either a store or a load two FPU units latency four cycles fully pipelined one unit can perform fadd operations the other fmul operations Our machine has no interlocks The result of an operation is written to the register file immediately after it has gone through the corresponding execution unit one cycle after issue for ALU operations three cycles for memory operations and four cycles for FPU operations The old values can be read from the registers until they have been overwritten Below is a diagram of our VLIW machine Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP ADD FP MULT Two Integer Units Single Cycle Latency Two Load Store Units Three Cycle Latency Page 4 of 18 Two Floating Point Units Four Cycle Latency The program for this problem translates to the following VLIW operations loop 1 2 3 4 5 6 7 8 9 10 11 12 ld f1 0 r1 ld f2 0 r2 fmul f4 f2 f1 st f4 0 r1 ld f3 0 r3 fadd f5 f4 f3 st f5 0 r3 add r1 r1 4 add r2 r2 4 add r3 r3 4 add r4 r4 1 bnez r4 loop f1 f2 f4 A i f3 f5 C i i A i B i f1 f2 f4 C i f4 f3 f5 loop Problem P5 2 A Table P5 2 1 on the next page shows our program rewritten for our VLIW machine with some operations missing instructions 2 6 and 7 We have rearranged the instructions to execute as soon as they possibly can but ensuring program correctness Please fill in missing operations Note you may not need all the rows Problem P5 2 B How many cycles are required to complete one iteration of the loop in steady state What is the performance flops cycle of the program Problem P5 2 C How many VLIW instructions would the smallest software pipelined loop require Explain briefly Ignore the prologue and the epilogue Note You do not need to write the software pipelined version You may consult Table P5 2 1 for help Problem P5 2 D Page 5 of 18 What would be the performance flops cycle of the program How many iterations of the loop would we have executing at the same time Page 6 of 18 ALU1 ALU2 MU1 Add r1 r1 4 add r2 r2 4 ld f1 0 r1 Add r3 r3 4 add r4 r4 1 ld f3 0 r3 MU2 FADD FMUL fmul f4 f2 f1 st f4 4 r1 bnez r4 loop Table P5 2 1 VLIW Program Page 7 of 18 Problem P5 2 E If we unrolled the loop once would that give us better performance How many VLIW instructions would we need for optimal performance How many flops cycle would we get Explain …
View Full Document
Unlocking...