Computer Systems Architecture CMSC 411 Unit 4b Software instructionlevel parallelism Alan Sussman December 7 2004 Administrivia Project questions Quiz 3 today questions Practice final posted soon Homework 6 posted due Thursday Read Chapter 4 except 4 8 and global code scheduling in 4 4 Online course evaluation available at https www courses umd edu online evaluation CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 2 1 Last time RAID RAID 3 bit interleaved parity RAID 4 block interleaved parity like RAID 3 but faster reads and writes RAID 5 RAID 4 but stripe parity blocks across disks RAID 6 use another disk to allow recovery from 2 failures I O performance measures and system design bandwidth vs latency performance analysis shows that CPU is usually not the bottleneck most of the time is spent accessing the disk Stale data attach I O bus to cache vs to memory DMA virtual vs physical addresses CMSC 411 Alan Sussman 3 Loop unrolling To improve performance of pipelines and simple multiple issue processors for static issue and for dynamic issue with static scheduling To fill pipeline stalls can be done dynamically in hardware or statically in compiler Look again at an example we used before CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 4 2 Example loop unrolling Loop L D F0 0 R1 x i x i s ADD D F4 F0 F2 uses F0 and F4 S D F4 0 R1 L D F6 8 R1 x i 1 x i 1 s ADD D F8 F6 F2 uses F6 and F8 unrolled to a depth of 4 S D F8 8 R1 for i 1000 996 992 4 L D F10 16 R1 x i 2 x i 2 s ADD D F12 F10 F2 uses F10 and F12 x i x i s S D F12 16 R1 x i 1 x i 1 s L D F14 24 R1 x i 3 x i 3 s x i 2 x i 2 s ADD D F16 F14 F2 uses F10 and F12 x i 3 x i 3 s S D F16 24 R1 end for DSUBI R1 R1 32 point to next element BNE R1 R2 Loop original loop for i 1000 999 1 x i x i s CMSC 411 Alan Sussman 5 And reschedule the loop Loop L D F0 0 R1 L D F6 8 R1 L D F10 16 R1 L D F14 24 R1 ADD D F4 F0 F2 ADD D F8 F6 F2 ADD D F12 F10 F2 ADD D F16 F14 F2 S D F4 0 R1 S D F8 8 R1 DSUBI R1 R1 32 S D F12 16 R1 BNE R1 R2 Loop S D F16 8 R1 CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 6 3 Example cont Note if 1000 were not divisible by 4 we would have a loop like this plus added code to take care of the last few elements How well does the unrolled code pipeline uses 14 cycles 4 elements instead of original code that used 6 cycles per element assuming issue 1 instruction per cycle and standard MIPS pipeline organization load delays functional unit latencies CMSC 411 Alan Sussman 7 Loop unrolling cont Limited only by number of available registers size of instruction cache want the unrolled loop to fit What is gained fewer pipeline stalls bubbles less loop overhead fewer DSUBIs and BNEs What is lost longer code many possibilities to introduce errors slower compilation more work for either the programmer or for the compiler writer CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 8 4 What did the compiler have to do Determine that it was legal to move S D after DSUBI and BNE and adjust S D offsets Determine that loop unrolling would be useful improve performance Use different registers to avoid name dependences Eliminate extra test and branch instructions and adjust loop termination and counter code Determine that loads and stores could be interchanged the ones from different iterations are independent requires memory address analysis Schedule the code preserving true dependences CMSC 411 Alan Sussman 9 Computer Systems Architecture CMSC 411 Unit 4b Software instructionlevel parallelism Alan Sussman December 9 2004 CMSC 411 A Sussman from D O Leary 5 Administrivia Practice final posted answers soon Homework 6 due today answers posted today Read Chapter 4 except 4 8 and global code scheduling in 4 4 practice HW questions w answers posted today Extra office hours today tomorrow 4 5PM Online course evaluation available at https www courses umd edu online evaluation CMSC 411 Alan Sussman 11 Last time Loop unrolling to fill pipeline stalls for static issue and dynamic issue with static scheduling limited by registers instruction cache removes stalls and loop overhead counters branches increases code size so makes more work for compiler and or programmer compiler has to check legality of reordering instructions from multiple loop iterations rename registers eliminate extra branches counter increments adjust loop bounds and step size memory address analysis to reorder loads and stores schedule the code and preserve flow dependences RAW CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 12 6 Dependences limit loop unrolling Loop L D F0 0 R1 ADD D F4 F0 F2 S D F4 0 R1 DSUBI R1 R1 8 BNEZ R1 R2 Loop In unrolling removed intermediate DSUBI instructions to reduce the data dependence for the L D and the control dependence for the BNEZ There are also antidependences so also made sure that later L D and BNEZ depend copies of the unrolled code used on result of DSUBI registers other than F0 F4 eliminated name dependences CMSC 411 Alan Sussman 13 True data dependences also limit unrolling for i 1 1000 x i 1 x i c i Uses value from previous iteration b i 1 d i x i 1 Uses the value just computed end for First assignment statement is an example of a loop carried dependence Second assignment statement doesn t limit unrolling but makes scheduling trickier CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 14 7 One problem for the programmer Precise exception handling becomes impossible if the compiler unrolls loops The order of operations that the user assumes is completely violated so exceptions occur at unrecognizable locations Rather than precise exception handling settle for the property that the unrolled code produces no new exceptions over the original code Also possible to provide software to simulate the code s behavior around the exception to help user diagnose the problem CMSC 411 Alan Sussman 15 Compilers need to detect data dependences for i 1 2 100 a i b i c i d i a i e i end for Can unroll this loop but must be careful not to interchange the order of the two statements Note that a i never needs to be loaded CMSC 411 Alan Sussman CMSC 411 A Sussman from D O Leary 16 8 Second example Quiz Unroll the loop to a depth of 4 …
View Full Document