Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models - D3019717

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture 3: Introduction to Parallel Architectures and Programming Models

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 41

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 41 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming ModelsRecap of Last LectureOutlineSearch Over Block SizesWhat the Search Space Looks LikeStrassen’s Matrix MultiplyStrassen (continued)Slide 8Removing False DependenciesExploit Multiple RegistersMinimize Pointer UpdatesLoop UnrollingExpose Independent OperationsCopy optimizationSlide 15A generic parallel architectureParallel Programming ModelsSimple ExampleProgramming Model 1: Shared MemoryMachine Model 1a: Shared MemoryMachine Model 1b: Distributed Shared MemoryShared Memory Code for Computing a SumPitfalls and Solution via SynchronizationProgramming Model 2: Message PassingMachine Model 2: Distributed MemoryComputing s = x(1)+x(2) on each processorProgramming Model 2b: Global Addr SpaceProgramming Model 3: Data ParallelMachine Model 3a: SIMD SystemModel 3B: Vector MachinesMachine Model 4: Clusters of SMPsSlide 32Top 500 SupercomputersPowerPoint PresentationSlide 35Slide 36Top 10 of the Fastest Computers in the WorldPerformance DevelopmentSlide 39Slide 40Summary01/14/19 CS267, YelickCS 267: Applications of Parallel ComputersLecture 3:Introduction to Parallel Architectures and Programming ModelsKathy Yelickhttp://www-inst.eecs.berkeley.edu/~cs26701/14/19 CS267, YelickRecap of Last Lecture•Memory systems on modern processors are complicated.•The performance of a simple program can depends on the details of the micro-architecture.•Simple performance models can aid in understanding•Two ratios are key to efficiency•algorithmic: q = f/m = # floating point opns / # slow memory opns•tm/tf = time for slow memory operation / time for floating point operation•A common technique for improving cache performance (lowering q) is called blocking•Applied to matrix multiplication.01/14/19 CS267, YelickOutline•Lecture 2 follow-up•Use of search in blocking matrix multiply•Strassen’s matrix multiply algorithm •Bag of tricks for optimizing serial code •Overview of parallel machines and programming models•Shared memory•Shared address space•Message passing•Data parallel•Clusters of SMPs•Trends in real machines01/14/19 CS267, YelickSearch Over Block Sizes•Performance models are useful for high level algorithms•Helps in developing a blocked algorithm•Models have not proven very useful for block size selection•too complicated to be useful–See work by Sid Chatterjee for detailed model•too simple to be accurate–Multiple multidimensional arrays, virtual memory, etc.•Some systems use search•Atlas – being incorporated into Matlab•BeBOP – http://www.cs.berkeley.edu/~richie/bebop01/14/19 CS267, YelickWhat the Search Space Looks LikeA 2-D slice of a 3-D register-tile search space. The dark blue region was pruned.(Platform: Sun Ultra-IIi, 333 MHz, 667 Mflop/s peak, Sun cc v5.0 compiler)Number of columns in register blockNumber of rows in register block01/14/19 CS267, YelickStrassen’s Matrix Multiply•The traditional algorithm (with or without tiling) has O(n^3) flops•Strassen discovered an algorithm with asymptotically lower flops•O(n^2.81)•Consider a 2x2 matrix multiply•normally 8 multiplies, Strassen does it with 7 multiplies (but many more adds)Let M = [m11 m12] = [a11 a12] * [b11 b12] [m21 m22] = [a21 a22] [b21 b22]Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22) p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11) p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11 p4 = (a11 + a12) * b22Then m11 = p1 + p2 - p4 + p6 m12 = p4 + p5 m21 = p6 + p7 m22 = p2 - p3 + p5 - p7Extends to nxn by divide&conquer01/14/19 CS267, YelickStrassen (continued)T(n) = Cost of multiplying nxn matrices = 7*T(n/2) + 18*(n/2)2 = O(n log2 7) = O(n 2.81) •Asymptotically faster •Several times faster for large n in practice•Cross-over depends on machine•Available in several libraries•Caveats•Needs more memory than standard algorithm•Can be less accurate because of roundoff error•Current world’s record is O(n 2.376.. )•Why does Hong/Kung theorem not apply?01/14/19 CS267, YelickOutline•Lecture 2 follow-up•Use of search in blocking matrix multiply•Strassen’s matrix multiply algorithm •Bag of tricks for optimizing serial code •Overview of parallel machines and programming models•Shared memory•Shared address space•Message passing•Data parallel•Clusters of SMPs•Trends in real machines01/14/19 CS267, YelickRemoving False Dependencies•Using local variables, reorder operations to remove false dependenciesa[i] = b[i] + c;a[i+1] = b[i+1] * d;float f1 = b[i];float f2 = b[i+1];a[i] = f1 + c;a[i+1] = f2 * d;false read-after-write hazardbetween a[i] and b[i+1]With some compilers, you can declare a and b unaliased.•Done via “restrict pointers,” compiler flag, or pragma)01/14/19 CS267, YelickExploit Multiple Registers•Reduce demands on memory bandwidth by pre-loading into local variableswhile( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++;}float f0 = filter[0];float f1 = filter[1];float f2 = filter[2];while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++;}also: register float f0 = …;Example is a convolution01/14/19 CS267, YelickMinimize Pointer Updates•Replace pointer updates for strided memory addressing with constant array offsetsf0 = *r8; r8 += 4;f1 = *r8; r8 += 4;f2 = *r8; r8 += 4;f0 = r8[0];f1 = r8[4];f2 = r8[8];r8 += 12;Pointer vs. array expression costs may differ.•Some compilers do a better job at analyzing one than the other01/14/19 CS267, YelickLoop Unrolling•Expose instruction-level parallelismfloat f0 = filter[0], f1 = filter[1], f2 = filter[2];float s0 = signal[0], s1 = signal[1], s2 = signal[2];*res++ = f0*s0 + f1*s1 + f2*s2;do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3;} while( … );01/14/19 CS267, YelickExpose Independent Operations•Hide instruction latency•Use local variables to expose independent operations that can execute in parallel or in a pipelined fashion•Balance the instruction mix (what functional units are available?)f1 = f5 * f9;f2 = f6 + f10;f3 = f7 * f11;f4 = f8 + f12;01/14/19 CS267, YelickCopy optimization•Copy

View Full Document

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-19-20-39-40-41 out of 41 pages.

Berkeley COMPSCI C267 - Lecture 3: Introduction to Parallel Architectures and Programming Models

Sign up for free to view:

Please select your school