Unformatted text preview:

1CMSC 838T – Lecture 13CMSC 838T – Lecture 13X Program analysis & transformation 0 Data-flow & data dependence analyses0 Guide program transformations0 Improve parallelism & localitylocalityX = … … = YparallelismCMSC 838T – Lecture 13Program Analysis & TransformationX Motivation0 Map high-level algorithm to low-level architecture0 Improve performanceAlgorithmArchitectureinstructions, registers, cache, TLB, memory, network, I/Ocomputation, data, values, codeprogram analysis & transformationefficiencymoreless2CMSC 838T – Lecture 13Analysis & Transformation – Approaches X Automatic0 Compiler directed0 Static / run-time analysis & transformation0 Low user effort, limited effectivenessX Interactive0 Programming environment / tool based0 Display static analysis, apply transformations as directed0 Moderate user effort, moderately effectiveX Manual0 Limited analyses from programming / profiling tools0 Apply transformations by hand0 High user effort & effectivenessCMSC 838T – Lecture 13Analysis – Dataflow & DependenceX Dataflow analysis0 Examine flow of values at compile-time0 Determines control flow & possible variable values0 ExampleO if (…) { X = 1 } else { X = 2 } Y = XO Value of Y is either 1 or 2X Data dependence analysis0 Examine memory accesses at compile-time0 Determines locality & constraints on execution order0 ExampleO for ( K ) { A[ K ] = A[ K – 2 ] }O Accesses same memory location as 2 iterations earlier3CMSC 838T – Lecture 13Dependence – Data & ControlX Data dependences0 True / flow x = … ; … = x ; read after write0 Anti … = x ; x = … ; write after read0 Output x = … ; x = … ; write after write0 Input … = x ; … = x ; read after readX Control dependence0 if (A) { B; } whether B executes depends on result of ifX Dependence A → B0 Dependence from A to B0 B depends on A0 A must be executed before BBACMSC 838T – Lecture 13Dependence – Loop Carried & IndependentX Loop-carried dependences0 Dependence crosses loop iterations0 ExampleO for ( K ) { A[ K ] = A[ K – 2 ] }O Dependence occurs across 2 loop iterationsX Loop-independent dependences0 Dependence occurs only on same loop iteration0 ExampleO for ( K ) { A[ K ] = A[ K ] + 1 }O Dependence occurs in same loop iteration4CMSC 838T – Lecture 13Dependence – ParallelismX Parallelism & dependence0 Computations may be executed in parallel if no dependencesO Loops may be parallelized if no loop-carried dependences0 Else data race (result depends on order) may cause errorO Some exceptions (e.g., input dependence, reduction)X = … … = YX = …… = XparallelsequentialCMSC 838T – Lecture 13Program TransformationsX Transformations0 Change structure of program0 Improve program in some manner (computation, data)0 Preserve program outputX Loop transformations0 Change loop structure, iteration orderLoop interchange Loop fission / fusionfor ( X )for ( Y )…for ( Y )for ( X )…for ( X )A ; Bfor ( X )Afor ( X )B5CMSC 838T – Lecture 13Program TransformationsX Transformations & dependence0 Computations may be reordered if dependences preservedO Directly (e.g., instruction scheduling)O Indirectly (e.g., program transformations)0 Computations may be eliminated if results unused0 Memory storage can be rearranged X Applying program transformations0 Ensure output preservedO Preserve dependences (rough approximation)O Preserve dataflow (more precise constraints)0 Use dependences to guide transformationsCMSC 838T – Lecture 13Program TransformationsX Motivation for transformations 0 Directly improve performanceO Increase localityO Exploit parallelismO Etc…0 Indirectly increase parallelism, enable other transformationsO PrivatizationO ExpansionO Reductions O Auxiliary induction variable substitutionO Etc…6CMSC 838T – Lecture 13Privatization & ExpansionX Memory-related dependences0 Caused by reusing memory 0 Can be eliminated by using new memory insteadO Anti dependence … = x ; x = … vs. … = x ; y = …O Output dependence x = … ; x = … vs. x = … ; x = …X Approaches0 Privatization – new memory per processor0 Expansion – new memory per loop iterationOriginaldo i =int kk = ……= kPrivatizationdo i =private int kk = ……= kExpansiondo i =int k[ n ]k[ i ] = …… = k [ i ] CMSC 838T – Lecture 13Reductions & Induction VariablesX Reductions 0 Associative & commutative operationsO Sum, multiply, maximum, minimum, etc…O Example: S = S + A[ i ]0 Can be executed in any order1. Perform reduction on private variable2. Combine results to global variable0 May affect numerical stability for floating point operationsX Auxiliary induction variable substitution0 Variables incremented by fixed amount each loop iterationO Example: for ( i ) { k = k + 1; p = p + 4; }0 May calculate directly from loop index & eliminate dependenceO Example: for ( i ) { k = i + c ; p = i * 4; }7CMSC 838T – Lecture 13Parallelism Optimizations – SynchronizationX Approach0 Increase size of parallel regions0 Reduce synchronization overhead / load imbalance0 Parallelize outer loops in loop nest0 Merge nearby parallel loopsmergeloop 1loop 2fused loopCMSC 838T – Lecture 13Parallelism Optimizations – CommunicationsX Approach0 Merge smaller messages into large message0 Reduce communication overhead0 Can move communication to deepest loop-carried dependenceDO j = m-1,1,-1DO i= 1,nRX(i,j) = … RX(i,j+1)DO j = m-1,1,-1comm (RX[1:n,j])DO i= 1,nRX(i,j) = … RX(i,j+1)δi8CMSC 838T – Lecture 13Locality OptimizationsX Locality0 Multiple references to same / nearby locationsX Types of locality0 Temporal (reuse data)0 Spatial (reuse nearby data)MemoryCacheProgram dataCMSC 838T – Lecture 130500100015002000250030003500198819891990199119921993199419951996199719981999200020012002YearSpeed (Mhz)Processor Clock Memory Bus ClockProcessor vs. Memory Speed (Latency)x386x486x486 DX2Pentiumx486 DX4Pentium ProPentium IIP II MMXP IIIP 4P 4FPM DRAM (420 ns)EDO DRAM (300 ns)SDRAM (200 ns)DDR-DRAM (200 ns)9CMSC 838T – Lecture 13Regular Memory Access PatternsX Characteristics0 Multidimensional arrays0 Multiple loop nests0 Also image processing, database scansX Goal0 Unit-stride access → exploit spatial localityRegular codesdo i = 1, Ndo j = 1, N… = node[ j, i ]j →i →CMSC 838T – Lecture 13X Approach0 Move reuses closer in time0 Better use of processor cache0 Tile data should now fit in cacheProgram Transformations – Tilingdo J=1,Ndo K=1,Ndo I=1,NC[I,J] = C[I,J] +A[I,K] * B[K,J]do KK=1,N,TKdo


View Full Document

UMD CMSC 838T - CMSC 838T Lecture 13

Documents in this Course
Load more
Download CMSC 838T Lecture 13
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CMSC 838T Lecture 13 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CMSC 838T Lecture 13 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?