Unformatted text preview:

Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. PrvulovicC/C++ programAssembly Code(binary) ProcessorCompiler 0010101010101011110MemoryMAR MDRIntroduction to computing systems (patt&patel)Control UnitProcessing UnitINPUTOUTPUTALUPCTEMPIR• http://www.youtube.com/watch?v=_Lm7Acr5ysY&feature=related0x00020x0003PCWBI-$D-$RFLD R1, MEM[R0]ADD R2, R2, #1BRZERO 0x00010x00010x00020x00030x000115 cycles7 cyclesDECNon-pipelinedPipelined0x00020x0003t• Data Dependencies– RAW: Read-After-Write (True Dependence)– WAR: Anti-Depedence– WAW: Output Dependence• Control Dependence– When following instructions depend on the outcome of a previous branch/jumpRegALUDMemIfetchRegInstr.Orderadd r1,r2,r3sub r4,r1,r3and r6,r2,r7IfetchALUDMemRegRegRegRegALUDMemIfetchRegRegALUDMemIfetchRegRegAll sources are ready?Why not execute them?A: R1 = R2 + R3B: R4 = R1 * R45-293R1R2R3R4Read-After-Write7-2937-2921AB5-293R1R2R3R45-29157-2915BAA: R1 = R3 / R4B: R3 = R2 * R4Write-After-Read5-293R1R2R3R43-2933-2-63AB5-293R1R2R3R45-2-63-2-2-63ABWrite-After-WriteA: R1 = R2 + R3B: R1 = R3 * R45-293R1R2R3R47-29327-293A B5-293R1R2R3R427-2937-293AB• WAR dependencies are from reusing registersA: R1 = R3 / R4B: R3 = R2 * R45-293R1R2R3R43-2933-2-63AB5-293R1R2R3R45-2-63-2-2-63BA5-293R1R2R3R45-2933-293BA4R5 -6 -6A: R1 = R3 / R4B: R5 = R2 * R4XWith no dependencies, reorderingstill produces the correct results• WAW dependencies are also from reusing registers5-293R1R2R3R427-29327-293B A4R5 4 7A: R1 = R2 + R3B: R1 = R3 * R45-293R1R2R3R47-29327-293A B5-293R1R2R3R427-2937-293ABA: R5 = R2 + R3B: R1 = R3 * R4XSame solution works• Give processor more registers than specified by the ISA– temporarily map ISA registers (“logical” or “architected” registers) to the ph ysical registers to avoid overwrites• Components:– mapping mechanism– physical registers• allocated vs. free registers• allocation/deallocation mechanismI1: ADD R1, R2, R3I2: SUB R2, R1, R5I3: AND R5, R11, R7I4: OR R8, R5, R2I5: XOR R2, R4, R11Program code• Example– I3 can not exec before I2 becauseI3 will overwrite R5– I5 can not go before I2 becauseI2, when it goes, will overwriteR2 with a stale valueRAWWARWAW• Solution:Let’s give I3 temporary name/location (e.g., S) for the valueit produces.• But I4 uses that value,so we must also change that to S…• In fact, all uses of R5 from I3 to the next instruction that writes to R5 again must now be changed to S!• We remove WAW deps in the same way: change R2 in I5 (and subsequent instrs) to T.I1: ADD R1, R2, R3I2: SUB R2, R1, R5I3: AND R5 R11, R7I4: OR R8, R5, R2I5: XOR R2, R4, R11I4: OR R8, S, R2I1: ADD R1, R2, R3I3: AND S, R11, R7I2: SUB R2, R1, R5I5: XOR T, R4, R11• Implementation– Space for S, T, etc.– How do we know whento rename a register?• Simple Solution– Do renaming for every instruction– Change the name of a registereach time we decode aninstruction that will write to it.– Remember what name we gave it I1: ADD R1, R2, R3I2: SUB R2, R1, R5I3: AND S, R11, R7I4: OR R8, S, R2I5: XOR T, R4, R11Program code• We need some physical structure to store the register valuesPRFARFRATRegisterAliasTablePhysicalRegisterFileArchitectedRegisterFileOne PREG per instruction in-flight“Outside” world sees the ARF• Separates architected vs. physical registers• Tracks program order of all in-flight insts– Enables in-order completion or “commit”Instruction BuffersopQj Qk Vj VkReservation Stations and ALUsopQj Qk Vj VkopQj Qk Vj VkopQj Qk Vj VkAddopQj Qk Vj VkopQj Qk Vj VkMultArchitected Register FileRATtype dest value finROB“head”• Read inst from inst buffer• Check if resources available:– Appropriate RS entry– ROB entry• Read RAT, read (available) sources, update RAT• Write to RS and ROBInstruction BuffersopQj Qk Vj VkReservation Stations and ALUsopQj Qk Vj VkopQj Qk Vj VkopQj Qk Vj VkAddopQj Qk Vj VkopQj Qk Vj VkMultArchitected Register FileRATtype dest value finROB“head”• Same as before– Wait for all operands to arrive– Compete to use functional unit– Execute!• Broadcast result on CDB– (any dependents will grab the value)• Write result back to your ROB entry– The ARF holds the “official” register state, which we will only update in program order– Mark ready/finished bit in ROB (note that this inst has completed execution)• Reservation station can be freed.• When an inst is the oldest in the ROB– i.e., ROB-head points to it• Write result (if ready/finished bit is set)– If register producing instruction: write to architected register file– If store: write to memory• Q: What about load?• Advance ROB-head to next instruction• This is what the outside world sees– And it’s all in-order• Make instruction execution “visible” to the outside world– “Commit” the changes to the architected stateABCDEFGHJKARFWB resultOutside World “sees”:A executedB executedC executedD executedE executedROBInstructions execute out of program order,but outside world still “believes” it’s in-order• Single thread in superscalar execution: dependences cause most of stalls• Idea: when one thread stalled, other can go• Different granularities of multithreading– Coarse MT: can change thread every few cycles– Fine MT: can change thread every cycle– Simultaneous Multithreading (SMT)• Instrs from different threads even in the same cycle• AKA Hyperthreading• Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC– poor utilization• SMP: 2-4 CPUs, but need independent tasks– else poor utilization as well• SMT: Idea is to use a single large uni-processor as a multi-processorRegular CPUCMP2x HW CostSMT (4 threads)Approx 1x HW Cost• For an N-way (N threads) SMT, we need:– Ability to fetch from N threads– N sets of architectural registers (including PCs)– N rename tables (RATs)– N virtual memory spaces– Front-end: branch predictor?: no, RAS? :yes• But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)31• Multiplex the Fetch LogicI$PC0PC1PC2cycle % Nfetch Decode, etc.RSCan do simple round-robin between active threads, or favor some over the others based on how much each is stalling relative to the others32• Thread #1’s R12 != Thread #2’s R12– separate name spaces– need to disambiguateRAT0RAT1Thread0Register #Thread1Register #PRF33• No change neededThread


View Full Document

GT CS 4803 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?