DOC PREVIEW
Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 41 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming ModelsRecap of Last LectureOutlineSearch Over Block SizesWhat the Search Space Looks LikeStrassen’s Matrix MultiplyStrassen (continued)Slide 8Removing False DependenciesExploit Multiple RegistersMinimize Pointer UpdatesLoop UnrollingExpose Independent OperationsCopy optimizationSlide 15A generic parallel architectureParallel Programming ModelsSimple ExampleProgramming Model 1: Shared MemoryMachine Model 1a: Shared MemoryMachine Model 1b: Distributed Shared MemoryShared Memory Code for Computing a SumPitfalls and Solution via SynchronizationProgramming Model 2: Message PassingMachine Model 2: Distributed MemoryComputing s = x(1)+x(2) on each processorProgramming Model 2b: Global Addr SpaceProgramming Model 3: Data ParallelMachine Model 3a: SIMD SystemModel 3B: Vector MachinesMachine Model 4: Clusters of SMPsSlide 32Top 500 SupercomputersPowerPoint PresentationSlide 35Slide 36Top 10 of the Fastest Computers in the WorldPerformance DevelopmentSlide 39Slide 40Summary01/14/19 CS267, YelickCS 267: Applications of Parallel ComputersLecture 3:Introduction to Parallel Architectures and Programming ModelsKathy Yelickhttp://www-inst.eecs.berkeley.edu/~cs26701/14/19 CS267, YelickRecap of Last Lecture•Memory systems on modern processors are complicated.•The performance of a simple program can depends on the details of the micro-architecture.•Simple performance models can aid in understanding•Two ratios are key to efficiency•algorithmic: q = f/m = # floating point opns / # slow memory opns•tm/tf = time for slow memory operation / time for floating point operation•A common technique for improving cache performance (lowering q) is called blocking•Applied to matrix multiplication.01/14/19 CS267, YelickOutline•Lecture 2 follow-up•Use of search in blocking matrix multiply•Strassen’s matrix multiply algorithm •Bag of tricks for optimizing serial code •Overview of parallel machines and programming models•Shared memory•Shared address space•Message passing•Data parallel•Clusters of SMPs•Trends in real machines01/14/19 CS267, YelickSearch Over Block Sizes•Performance models are useful for high level algorithms•Helps in developing a blocked algorithm•Models have not proven very useful for block size selection•too complicated to be useful–See work by Sid Chatterjee for detailed model•too simple to be accurate–Multiple multidimensional arrays, virtual memory, etc.•Some systems use search•Atlas – being incorporated into Matlab•BeBOP – http://www.cs.berkeley.edu/~richie/bebop01/14/19 CS267, YelickWhat the Search Space Looks LikeA 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.(Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)Number of columns in register blockNumber of rows in register block01/14/19 CS267, YelickStrassen’s Matrix Multiply•The traditional algorithm (with or without tiling) has O(n^3) flops•Strassen discovered an algorithm with asymptotically lower flops•O(n^2.81)•Consider a 2x2 matrix multiply•normally 8 multiplies, Strassen does it with 7 multiplies (but many more adds)Let M = [m11 m12] = [a11 a12] * [b11 b12] [m21 m22] = [a21 a22] [b21 b22]Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 m21 = p6 + p7 m22 = p2 - p3 + p5 - p7Extends to nxn by divide&conquer01/14/19 CS267, YelickStrassen (continued)T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)2 = O(n log2 7) = O(n 2.81) •Asymptotically faster •Several times faster for large n in practice•Cross-over depends on machine•Available in several libraries•Caveats•Needs more memory than standard algorithm•Can be less accurate because of roundoff error•Current world’s record is O(n 2.376.. )•Why does Hong/Kung theorem not apply?01/14/19 CS267, YelickOutline•Lecture 2 follow-up•Use of search in blocking matrix multiply•Strassen’s matrix multiply algorithm •Bag of tricks for optimizing serial code •Overview of parallel machines and programming models•Shared memory•Shared address space•Message passing•Data parallel•Clusters of SMPs•Trends in real machines01/14/19 CS267, YelickRemoving False Dependencies•Using local variables, reorder operations to remove false dependenciesa[i] = b[i] + c;a[i+1] = b[i+1] * d;float f1 = b[i];float f2 = b[i+1];a[i] = f1 + c;a[i+1] = f2 * d;false read-after-write hazardbetween a[i] and b[i+1]With some compilers, you can declare a and b unaliased.•Done via “restrict pointers,” compiler flag, or pragma)01/14/19 CS267, YelickExploit Multiple Registers•Reduce demands on memory bandwidth by pre-loading into local variableswhile( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++;}float f0 = filter[0];float f1 = filter[1];float f2 = filter[2];while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++;}also: register float f0 = …;Example is a convolution01/14/19 CS267, YelickMinimize Pointer Updates•Replace pointer updates for strided memory addressing with constant array offsetsf0 = *r8; r8 += 4;f1 = *r8; r8 += 4;f2 = *r8; r8 += 4;f0 = r8[0];f1 = r8[4];f2 = r8[8];r8 += 12;Pointer vs. array expression costs may differ.•Some compilers do a better job at analyzing one than the other01/14/19 CS267, YelickLoop Unrolling•Expose instruction-level parallelismfloat f0 = filter[0], f1 = filter[1], f2 = filter[2];float s0 = signal[0], s1 = signal[1], s2 = signal[2];*res++ = f0*s0 + f1*s1 + f2*s2;do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3;} while( … );01/14/19 CS267, YelickExpose Independent Operations•Hide instruction latency•Use local variables to expose independent operations that can execute in parallel or in a pipelined fashion•Balance the instruction mix (what functional units are available?)f1 = f5 * f9;f2 = f6 + f10;f3 = f7 * f11;f4 = f8 + f12;01/14/19 CS267, YelickCopy optimization•Copy


View Full Document

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Lecture 3: Introduction to Parallel Architectures and Programming Models
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 3: Introduction to Parallel Architectures and Programming Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 3: Introduction to Parallel Architectures and Programming Models 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?