CS433: Computer System OrganizationDependencesSlide 3Elements Required for DependenceStatic Exploitation of ILPMultiple-Issue versus VLIWMultiple Issue versus VLIWBasic Pipeline SchedulingSlide 9Loop UnrollingLoop Unrolling: Register Renaming And Normalization of Induction VariablesLoop Unrolling: Scheduling After Register RenamingLoop Unrolling: CleanupRegister Allocation and ILPSo-Called “Phase Ordering” Problem: Schedule First, Using Virtual RegistersSo-Called “Phase Ordering” Problem: Allocate FirstQuestionLoops and Dependence DistanceExamples of DependenceSlide 20Slide 21Slide 22Slide 23Slide 24Slide 25What if we write it this way:Software PipeliningAnti Dependences and VLIW / SuperscalarSlide 29Software Pipelining and Loop ReversalLoop ReversalSlide 32SW Pipelining and Flow and Anti DependencesSW Pipelining Changes Flow Dependences into Anti DependencesFlow and Anti DependencesPrologue and EpilogueSlide 37Dependence ConstraintsDependence Constraints for Software PipeliningSlide 40Slide 41CS433: Computer System OrganizationLuddy HarrisonStatic Exploitation ofInstruction Level ParallelismDependencesFrom an earlier statement to a later statement (in program execution order).This is a time ordering, not a visual or textual ordering.Three types:Flow:A == AAnti:= AA =Output:A =A =DependencesFlow:Earlier statement writes same storage location that later statement readsStorage location may be register, memory, anything that holds state.Anti:Earlier statement reads same storage location that later statement writesOutput:Earlier statement writes same storage location that later statement writesElements Required for DependenceStatement instancesTwo instructions / statements or two instances of a single instruction / statementDynamic instances thereof – statements and instructions may be executed many timesTimeearlier statement and later statementordered by program execution orderLocationthe two statements must access the same locationthis may be a register or memory or any other kind of stateAccess (Read and/or Write)one of the accesses must be a write (modification)Static Exploitation of ILPStatic exploitation requires eitherA programmer who writes assembly language, orA compiler organize the instructions of the program such that there is sufficient parallelism between themWe will consider compiler techniques for the most partI will mention when an assembly programmer might do better or differentlyMultiple-Issue versus VLIWWhat’s the difference?Instructions in a VLIW “line” must not depend on one anotherIt is permitted however to use as a destination a register than another instruction in the same line is using as a source. The source is read before it is overwritten as a destinationThis is a very important source of additional parallelism in VLIW machines!In a multiple-issue-capable CPU (but not a VLIW), if adjacent instructions are dependent, a stall occurs between the instructions, but the program still works.Multiple Issue versus VLIW add R0, R1, R2 add R3, R4, R5 add R1, R0, R6 add R4, R3, R7What does this program do if run on a multiple issue processor that can run up to four instructions in parallel?What does it do if run as a single instruction line on a 4-wide VLIW?Basic Pipeline Schedulingload R3, [R1 + 12]add R3, R3, 4load R0, [R1 + 16]add R0, R0, 4load R3, [R1 + 12]load R0, [R1 + 16]add R3, R3, 4add R0, R0, 4The colors represent dependent instructions.Assume there is a one-cycle stall between a load and an instruction that uses the load’s result. We reorder the instructions to eliminate these stalls.Here we don’t need to change the instructions, but that is unusual. Usually we are not so lucky.Basic Pipeline Schedulingload R3, [R1 + 12]add R0, R0, R3load R3, [R1 + 16]add R0, R0, R3load R3, [R1 + 12]add R0, R0, R3load R4, [R1 + 16]add R0, R0, R4load R3, [R1 + 12]load R4, [R1 + 16]add R0, R0, R3add R0, R0, R4We would like to do the same transformation as before, but R3 is used as the destination for both loads.We rename R3 to R4 in the second two instructions. This breaks the anti-dependence.renamereorderLoop Unrolling move R2, 0 move R3, 100L: load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 bge L move R2, 0 move R3, 100L: load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 bge LcopycopyIs this transformation always legal?What are the required conditions?Loop Unrolling: Register Renaming And Normalization of Induction Variables move R2, 0 move R3, 100L: load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 load R0, [R1 + 16] add R2, R2, R0 add R1, R1, 24 sub R3, R3, 1 bge L move R2, 0 move R3, 100L: load R0, [R1 + 16] add R2, R2, R0 add R5, R1, 24 sub R3, R3, 1 load R4, [R1 + 40] add R2, R2, R4 add R1, R5, 24 sub R3, R3, 1 bge LLoop Unrolling: Scheduling After Register Renaming move R2, 0 move R3, 100L: load R0, [R1 + 16] add R2, R2, R0 add R5, R1, 24 sub R3, R3, 1 load R4, [R1 + 40] add R2, R2, R4 add R1, R5, 24 sub R3, R3, 1 bge L move R2, 0 move R3, 100L: load R0, [R1 + 16] load R4, [R1 + 40] add R2, R2, R0 add R2, R2, R4 add R5, R1, 24 add R1, R5, 24 sub R3, R3, 1 sub R3, R3, 1 bge LLoop Unrolling: Cleanup move R2, 0 move R3, 100L: load R0, [R1 + 16] load R4, [R1 + 40] add R2, R2, R0 add R2, R2, R4 add R1, R1, 24 add R1, R1, 24 sub R3, R3, 1 sub R3, R3, 1 bge L move R2, 0 move R3, 100L: load R0, [R1 + 16] load R4, [R1 + 40] add R2, R2, R0 add R2, R2, R4 add R1, R1, 48 sub R3, R3, 2 bge LRegister Allocation and ILPCan you imagine a world in which assigning seats on an airplane and scheduling people on flights were considered two different activities?“First we schedule everybody who wants to fly to Memphis on March 19th on flight 459 at 10AM. Then, we assign them seats.”Or, “First we assign everybody a seat. Next we look to see what day and time they want to fly.”This what we do with registers and instruction schedules (roughly speaking)So-Called “Phase Ordering” Problem: Schedule First, Using Virtual Registersload V0add V1, V0load V2add V1, V2load V3add V1, V4load V5add V1, V6load V7add V1, V7load V8add V1, V8load V0load V2load V3load V4load V5load V6add V1,
View Full Document