UMD CMSC 411 - Unit 4b – Software instruction- level parallelism - D953317

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 411> Unit 4b – Software instruction- level parallelism

DOC PREVIEW

UMD CMSC 411 - Unit 4b – Software instruction- level parallelism

School name University of Maryland, College Park

Course Cmsc 411- Computer Systems Architecture

Pages 18

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 18 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CMSC 411 - A. Sussman (from D. O'Leary) 1Computer Systems ArchitectureCMSC 411Unit 4b – Software instruction-level parallelismAlan SussmanDecember 7, 2004CMSC 411 - Alan Sussman 2Administrivia• Project questions?• Quiz 3 today– questions• Practice final posted soon• Homework 6 posted, due Thursday• Read Chapter 4, except 4.8 and global code scheduling in 4.4• Online course evaluation available at https://www.courses.umd.edu/online_evaluationCMSC 411 - A. Sussman (from D. O'Leary) 2CMSC 411 - Alan Sussman 3Last time•RAID– RAID 3 – bit-interleaved parity– RAID 4 – block-interleaved parity – like RAID 3, but faster reads and writes– RAID 5 – RAID 4, but stripe parity blocks across disks– RAID 6 – use another disk to allow recovery from 2 failures• I/O performance measures and system design– bandwidth vs. latency– performance analysis shows that CPU is usually not the bottleneck (most of the time is spent accessing the disk)• Stale data– attach I/O bus to cache vs. to memory•DMA– virtual vs. physical addressesCMSC 411 - Alan Sussman 4Loop unrolling• To improve performance of pipelines and simple multiple issue processors– for static issue, and for dynamic issue with static scheduling• To fill pipeline stalls– can be done dynamically in hardware, or statically in compiler• Look again at an example we used beforeCMSC 411 - A. Sussman (from D. O'Leary) 3CMSC 411 - Alan Sussman 5Example - loop unrollingoriginal loop:for i=1000, 999, ..., 1x[i] = x[i] + s;unrolled to a depth of 4:for i=1000, 996, 992,..., 4x[i] = x[i] + s;x[i-1] = x[i-1] + s;x[i-2] = x[i-2] + s;x[i-3] = x[i-3] + s;end forLoop:L.D F0,0(R1) x[i] = x[i] + sADD.D F4,F0,F2 uses F0 and F4S.D F4,0(R1)L.D F6,-8(R1) x[i-1] = x[i-1] + sADD.D F8,F6,F2 uses F6 and F8S.D F8,-8(R1)L.D F10,-16(R1) x[i-2] = x[i-2] + sADD.D F12,F10,F2 uses F10 and F12S.D F12,-16(R1)L.D F14,-24(R1) x[i-3] = x[i-3] + sADD.D F16,F14,F2 uses F10 and F12S.D F16,-24(R1)DSUBI R1,R1,#32 point to next elementBNE R1,R2,LoopCMSC 411 - Alan Sussman 6And reschedule the loopLoop:L.D F0,0(R1)L.D F6,-8(R1) L.D F10,-16(R1)L.D F14,-24(R1) ADD.D F4, F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,-8(R1)DSUBI R1,R1,#32 S.D F12,16(R1)BNE R1,R2,LoopS.D F16, 8(R1)CMSC 411 - A. Sussman (from D. O'Leary) 4CMSC 411 - Alan Sussman 7Example (cont.)•Note: if 1000 were not divisible by 4, we would have a loop like this plus added code to take care of the last few elements• How well does the unrolled code pipeline?– uses (14 cycles)/(4 elements), instead of original code that used 6 cycles per element• assuming issue 1 instruction per cycle, and standard MIPS pipeline organization (load delays, functional unit latencies)CMSC 411 - Alan Sussman 8Loop unrolling (cont.)• Limited only by:– number of available registers– size of instruction cache - want the unrolled loop to fit• What is gained– fewer pipeline stalls/bubbles– less loop overhead - fewer DSUBIs and BNEs• What is lost– longer code– many possibilities to introduce errors– slower compilation– more work for either the programmer or for the compiler writerCMSC 411 - A. Sussman (from D. O'Leary) 5CMSC 411 - Alan Sussman 9What did the compiler have to do?• Determine that it was legal to move S.D after DSUBI and BNE, and adjust S.D offsets• Determine that loop unrolling would be useful –improve performance• Use different registers to avoid name dependences• Eliminate extra test and branch instructions, and adjust loop termination and counter code• Determine that loads and stores could be interchanged – the ones from different iterations are independent – requires memory address analysis• Schedule the code, preserving true dependencesComputer Systems ArchitectureCMSC 411Unit 4b – Software instruction-level parallelismAlan SussmanDecember 9, 2004CMSC 411 - A. Sussman (from D. O'Leary) 6CMSC 411 - Alan Sussman 11Administrivia• Practice final posted, answers soon• Homework 6 due today, answers posted today• Read Chapter 4, except 4.8 and global code scheduling in 4.4– practice HW questions w/answers posted today• Extra office hours today & tomorrow, 4-5PM• Online course evaluation available at https://www.courses.umd.edu/online_evaluationCMSC 411 - Alan Sussman 12Last time• Loop unrolling– to fill pipeline stalls, for static issue and dynamic issue with static scheduling– limited by registers, instruction cache– removes stalls, and loop overhead (counters, branches)– increases code size, so makes more work for compiler and/or programmer– compiler has to• check legality of reordering instructions from multiple loop iterations• rename registers• eliminate extra branches, counter increments• adjust loop bounds and step size• memory address analysis – to reorder loads and stores• schedule the code, and preserve flow dependences (RAW)CMSC 411 - A. Sussman (from D. O'Leary) 7CMSC 411 - Alan Sussman 13Dependences limit loop unrolling• In unrolling, removed intermediate DSUBI instructions to reduce the data dependencefor the L.D and the control dependence for the BNEZ• There are also antidependences, so also made sure that later copies of the unrolled code used registers other than F0 & F4 -eliminated name dependencesLoop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)DSUBI R1,R1,#8BNEZ R1,R2,Loop L.D and BNEZ depend on result of DSUBICMSC 411 - Alan Sussman 14True data dependences also limit unrolling• First assignment statement is an example of a loop-carried dependence• Second assignment statement doesn't limit unrolling, but makes scheduling trickierfor i=1,...,1000x[i+1] = x[i] + c[i] ;Uses value from previous iterationb[i+1] = d[i] + x[i+1] ;Uses the value just computedend forCMSC 411 - A. Sussman (from D. O'Leary) 8CMSC 411 - Alan Sussman 15One problem for the programmer• Precise exception handling becomes impossible if the compiler unrolls loops• The order of operations that the user assumes is completely violated, so exceptions occur at unrecognizable locations• Rather than precise exception handling, settle for the property that the unrolled code produces no new exceptions over the original code• Also possible to provide software to simulate the code's behavior around the exception to help user diagnose the problemCMSC 411 - Alan

View Full Document