Unformatted text preview:

1 Lecture 6: Static ILP • Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) • Please hand in Assignment 1 now • Assignment 2 posted; due in a week2 Smart Schedule • By re-ordering instructions, it takes 6 cycles per iteration instead of 10 • We were able to violate an anti-dependence easily because an immediate was involved • Loop overhead (instrs that do book-keeping for the loop): 2 Actual work (the ld, add.d, and s.d): 3 instrs Can we somehow get execution time to be 3 cycles per iteration? Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall stall S.D F4, 0(R1) DADDUI R1, R1,# -8 stall BNE R1, R2, Loop stall Loop: L.D F0, 0(R1) DADDUI R1, R1,# -8 ADD.D F4, F0, F2 stall BNE R1, R2, Loop S.D F4, 8(R1)3 Loop Unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop • Loop overhead: 2 instrs; Work: 12 instrs • How long will the above schedule take to complete?4 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) • Execution time: 14 cycles or 3.5 cycles per original iteration5 Loop Unrolling • Increases program size • Requires more registers • To unroll an n-iteration loop by degree k, we will need (n/k) iterations of the larger loop, followed by (n mod k) iterations of the original loop6 Automating Loop Unrolling • Determine the dependences across iterations: in the example, we knew that loads and stores in different iterations did not conflict and could be re-ordered • Determine if unrolling will help – possible only if iterations are independent • Determine address offsets for different loads/stores • Dependency analysis to schedule code without introducing hazards; eliminate name dependences by using additional registers7 Superscalar Pipelines Integer pipeline FP pipeline Handles L.D, S.D, ADDUI, BNE Handles ADD.D • What is the schedule with an unroll degree of 4?8 Superscalar Pipelines Integer pipeline FP pipeline Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1) • Need unroll by degree 5 to eliminate stalls • The compiler may specify instructions that can be issued as one packet • The compiler may specify a fixed number of instructions in each packet: Very Large Instruction Word (VLIW)9 Loop Dependences • If a loop only has dependences within an iteration, the loop is considered parallel  multiple iterations can be executed together so long as order within an iteration is preserved • If a loop has dependeces across iterations, it is not parallel and these dependeces are referred to as “loop-carried” • Not all loop-carried dependences imply lack of parallelism10 Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S111 Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration S1 depends on S2 from prev iteration For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1 S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism No dependences12 Constructing Parallel Loops If loop-carried dependences are not cyclic (S1 depending on S1 is cyclic), loops can be restructured to be parallel For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } A[1] = A[1] + B[1]; For (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; S3 A[i+1] = A[i+1] + B[i+1]; S4 } B[101] = C[100] + D[100]; S1 depends on S2 from prev iteration S4 depends on S3 of same iteration13 Finding Dependences – the GCD Test • Do A[ai + b] and A[ci + d] refer to the same element? • Restrict ourselves to affine array indices (expressible as ai + b, where i is the loop index, a and b are constants) – example of non-affine index: x[y[i]] • For a dependence to exist, must have two indices j and k that are within the loop bounds, such that aj + b = ck + d; aj – ck = d – b; G = GCD(a,c); (aj/G - ck/G) = (d-b)/G; • If (d-b)/G is not an integer, the initial equality can not be true14 Software Pipeline?! L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D L.D ADD.D DADDUI BNE DADDUI BNE DADDUI BNE DADDUI BNE DADDUI BNE … … Loop: L.D F0, 0(R1) ADD.D F4, F0, F2


View Full Document

U of U CS 6810 - Static ILP

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Static ILP
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Static ILP and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Static ILP 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?