University of Southern California Department of Electrical Engineering EE557 Spring 2K14 Instructor Michel Dubois and Murali Annavaram Section 30667 30820D 30716D 30455D and 30823D Homework 1 Due Tuesday February 11 5pm TOTAL SCORE 150 Problem 1 20pts Problem 1 1 in the book with the following modifications The time taken by each floating point instruction can be reduced by a factor 15 with the new hardware The Loads and Stores can be speeded up by a factor 3 over the base machine a Unchanged b Unchanged c The speedup is 30 or 1 3 d In the original workload fractions Ffp and Fls are 20 and 40 respectively Problem 2 20pts Problem 1 3 in the book but using the following tables Execution times of three programs Machines Program1 Program2 Program3 Base machine 17sec 10msec 5sec Base FP units 16sec 2msec 2sec Base cache 10sec 9msec 3sec Normalized execution times of three programs Machines Program1 Program2 Program3 Base machine 1 1 1 Base FP units 0 94 0 2 0 4 Base cache 0 59 0 9 0 6 Problem 3 20pts Problem 1 5 in the book but using the following table Instructions Frequency Cycles Arithmetic logic 25 1 Loads 40 2 Stores 5 1 Branches Untaken 15 2 Branches Taken 5 4 Miscellaneous 10 1 Problem 4 30pts Problem 2 1 in the book with the following modifications Considering the following designs for the problem A 6 stage pipeline clocked at 2f B Single cycle CPU clocked at f C 5 way multiprocessor in which each processor is the single CPU clocked at 2f D 5 way multiprocessor in which each processor is 6 stage pipeline and is clocked at 2f Compare these four design with the base machine which is the 5 stage pipeline clocked at f Problem 5 20pts Problem 3 1 in the book considering all memory addresses are 16 bits Problem 6 20pts Problem 3 3 in the book but compile the following code A 0 is stored at memory address 1000 The code computes the first 100 elements of Fibonacci series A 0 0 A 1 1 for i 2 i 100 i A i A i 1 A i 2 Problem 7 20pts The combination of two enhancements are considered to boost the performance of a chip multiprocessor The enhancements are 1 adding more cores or 2 adding more shared level 2 cache The base chip has 2 cores and 8 L2 cache banks L2 cache can be added by adding cache banks and each cache bank uses three times the area of a core Here is what we also know from all kinds of sources 1 70 of the workload can be fully parallelized the rest cannot 2 The core stall time due to L2 misses accounts for 10 of each core s execution time in the base configuration 3 It is suspected that the amount of shared L2 cache per core should remain constant in order to keep the same miss rate 4 Simulations have also determined that the miss rate of L2 decreases as the square root of its size per core A conjecture is that the stall time in each core will also decrease as the square root of L2 size per cores The company that pays your paycheck has acquired a new technology to build large micro chips so that the next generation chips will have four times the area of current chips to dedicate to cores and L2 caches Given what you know what kind of best first cut design would you propose A design is characterized by of cores of L2 cache banks These numbers can be any integer The design should be contained in the new chip Estimate the speedup of your best design that takes advantage of the new chip real estate
View Full Document