GT CS 4803 - LECTURE NOTES - D839380

Home> Schools> Georgia Tech> Computer Science (CS) > CS 4803> LECTURE NOTES

GT CS 4803 - LECTURE NOTES

Pages 36

Download Save

Unformatted text preview:

Spring 2011 Prof. Hyesoon Kim2TTT TTTT TSIMD Execution UnitTTT T Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture)Warp ExecutionOne warpSources ready3One warpOne warpInst 1Inst 2Inst 3Sources ready Sources readyTTT T TTT TFinite number of streaming processors• Threads are assigned to SMs in Block granularity– Up to 8 Blocks to each SM as resource allows (# of blocks is dependent on the architecture)– SM in G80 can take up to 768 threads• Could be 256 (threads/block) * 3 blocks • Or 128 (threads/block) * 6 blocks, etc.• Threads run concurrently– SM assigns/maintains thread id #s– SM manages/schedules thread executiont0 t1 t2 … tmBlocksTexture L1SPSharedMemoryMT IUSPSharedMemoryMT IUTFL2Memoryt0 t1 t2 … tmBlocksSM 1SM 0© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUCPCPCPCPCPCPCPCMuxI-cachewarp #idstall• Fetch– One instruction for each warp (could be further optimizations) – Round Robin, Greedy-fetch (switch when stall events such as branch, I-cache misses, buffer full)• Thread scheduling polices – Execute when all sources are ready – In-order execution within warps– Scheduling polices: Greedy-execution, round-robin TB1W1TB = Thread Block, W = WarpTB2W1TB3W1TB2W1TB1W1TB3W2TB1W2TB1W3TB3W2TimeTB1, W1 stallTB3, W2 stallTB2, W1 stallInstruction:1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4Kirk & Hwu• Enough parallelism– Switch to another thread– Speculative execution is • Branch predictor could be expensive – Per thread predictor• Branch elimination techniques • Pipeline flush is too costly• Basic Block – Def: a sequence of consecutive operations in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end– Single entry, single exit Add r1, r2, r3 Br.cond target Mov r3, r4 Br jmp joinTarget add r1, r2, r3Join mov r4 r5 AABCDB CDControl-flow graphhttp://www.eecs.umich.edu/~mahlke/583w04/• Defn: Dominator – Given a CFG, a node x dominates a node y, if every path from the Entry block to y contains x– Given some BB, which blocks are guaranteed to have executed prior to executing the BB• Defn: Post dominator: Given a CFG, a node x post dominates a node y, if every path from y to the Exit contains x • Given some BB, which blocks are guaranteed to have executed after executing the BB– reverse of dominatorBB1BB2BB4BB3BB5 BB6BB7http://www.eecs.umich.edu/~mahlke/583w04/• Immediate post dominator is the reconvergence point of divergent branch• Defn: Immediate post dominator (ipdom) – Each node n has a unique immediate post dominator m that is the first post dominator of n on any path from n to the Exit– Closest node that post dominates– First breadth-first successor that post dominates a nodeBB1BB2BB4BB3BB5 BB6BB7EntryExithttp://www.eecs.umich.edu/~mahlke/583w04/• Recap:– 32 threads in a warm are executed in SIMD (share one instruction sequencer)– Threads within a warp can be disabled (masked)• For example, handling bank conflicts– Threads contain arbitrary code including conditional branches• How do we handle different conditions in different threads?– No problem if the threads are in different warps– Control dive rgence– Predicationhttps://users.ece.utexas.edu/~merez/new/pmwiki.php/EE382VFa07/Schedule?action=download&upname=EE382V_Fa07_Lect14_G80Control.pdf• Predication• Loop unrollingConvert control flow dependency to data dependencyPro: Eliminate hard-to-predict branches (in traditional architecture)Eliminate branch divergence (in CUDA)Cons: Extra instructions (normal branch code)C BDATNp1 = (cond)branch p1, TARGETmov b, 1 jmp JOINTARGET:mov b, 0ABCBCDA(predicated code) ABCif (cond) {b = 0;}else {b = 1;}p1 = (cond)(!p1)mov b, 1(p1) mov b, 0• Comparison instructions set condition codes (CC)• Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.)• Compiler tries to predict if a branch condition is likely to produce many divergent warps– If guaranteed not to diverge: only predicates if < 4 instructions– If not guaranteed: only predicates if < 7 instructions• May replace branches with instruction predication• ALL predicated instructions take execution cycles– Those with false conditions don’t write their output• Or invoke memory loads and stores– Saves branch instructions, so can be cheaper than serializing divergent paths© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• Transforms an M-iteration loop intoa loop with M/N iterations– We say that the loop has been unrolled N timesfor(i=0;i< 100;i+=4){a[i]*=2;a[i+1]*=2;a[i+2]*=2;a[i+3]*=2;}for(i=0;i<100;i++)a[i]*=2;http://www.cc.gatech.edu/~milos/CS6290F07/• Sum { 1- 100}, How to calculate?• Reduction example• What about other threads? • What about different paths? 0 1 2 3 4 5 6 70 2 4 60 40If (threadId.x%==2) If (threadId.x%==4) If (threadId.x%==8) ABCDIf (threadid.x>2) {do work B}else { do work C} From Fung et al. MICRO ‘07Divergent branch!• All branch conditions are serialized and will be executed– Parallel code  sequential code• Divergence occurs within a warp granularity. • It’s a performance issue– Degree of nested branches • Depending on memory instructions, (cache hits or misses), divergent warps can occur – Dynamic warp subdivision [Meng’10]• Hardware solutions to reduce divergent branches – Dynamic warp formation [Fung’07]• Register File (RF)– 32 KB– Provides 4 operands/clock• TEX pipe can also read/write RF– 2 SMs share 1 TEX• Load/Store pipe can also read/write RFI$L1MultithreadedInstruction BufferRFC$L1SharedMemOperand SelectMAD SFU© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• Multiple read ports • BanksRead1Read2write1ReadwriteRead3Read4ReadwriteReadwriteReadwriteR1 R2R3 R4R1R2R3R4• Threads in a Block share data & results– In Memory and Shared Memory– Synchronize at barrier instruction• Per-Block Shared Memory Allocation– Keeps data close to processor– Minimize trips to global Memory– SM Shared Memory dynamically allocated to Blocks, one of the limiting resourcest0 t1 t2 … tmBlocksTexture L1SPSharedMemoryMT IUSPSharedMemoryMT IUTFL2Memoryt0 t1 t2 … tmBlocksSM 1SM 0Courtesy: John Nicols, NVIDIA© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• Immediate address constants• Indexed

View Full Document


School:
Email:
New Password:
Confirm Password:

GT CS 4803 - LECTURE NOTES

Sign up for free to view:

Please select your school