DOC PREVIEW
Berkeley COMPSCI C267 - Single Processor Optimizations Matrix Multiplication Case Study

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 49 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1/24/2007 CS267 Lecture 3 1Single Processor Optimizations Matrix Multiplication Case StudyKathy [email protected]/~yelick/cs267_spr071/24/2007 CS267 Lecture 3 2Outline• Recap from Lecture 2• Memory hierarchy is important to performance• Use of simple performance models to understand performance• Case Study: Matrix Multiplication• Blocking algorithms• Other tuning techniques• Alternate algorithms• Automatic Performance Tuning1/24/2007 CS267 Lecture 3 3Outline• Recap from Lecture 2• Memory hierarchy is important to performance• Use of simple performance models to understand performance• Case Study: Matrix Multiplication• Blocking algorithms• Other tuning techniques• Alternate algorithms• Automatic Performance Tuning1/24/2007 CS267 Lecture 3 4Memory Hierarchy• Most programs have a high degree of locality in their accesses• spatial locality: accessing things nearby previous accesses• temporal locality: reusing an item that was previously accessed• Memory hierarchy tries to exploit localityon-chip cacheregistersdatapathcontrolprocessorSecond level cache (SRAM)Main memory(DRAM)Secondary storage (Disk)Tertiary storage(Disk/Tape)TBGBMBKBBSize10sec10ms100ns10ns1nsSpeed1/24/2007 CS267 Lecture 3 5Computational Intensity: Key to algorithm efficiencyMachine Balance:Key to machine efficiency Using a Simple Model of Memory to Optimize• Assume just 2 levels in the hierarchy, fast and slow• All data initially in slow memory•m= number of memory elements (words) moved between fast and slow memory •tm= time per slow memory operation•f= number of arithmetic operations•tf= time per arithmetic operation << tm•q= f / m average number of flops per slow memory access• Minimum possible time = f* tfwhen all data in fast memory• Actual time •f * tf+ m * tm= f * tf* (1 + tm/tf* 1/q) •Larger q means time closer to minimum f * tf•q ≥ tm/tfneeded to get at least half of peak speed1/24/2007 CS267 Lecture 3 6Warm up: Matrix-vector multiplication{read x(1:n) into fast memory}{read y(1:n) into fast memory}for i = 1:n{read row i of A into fast memory}for j = 1:ny(i) = y(i) + A(i,j)*x(j){write y(1:n) back to slow memory}• m = number of slow memory refs = 3n + n2• f = number of arithmetic operations = 2n2• q = f / m ~= 2• Matrix-vector multiplication limited by slow memory speed1/24/2007 CS267 Lecture 3 7Modeling Matrix-Vector Multiplication• Compute time for nxn = 1000x1000 matrix•Time •f * tf+ m * tm= f * tf* (1 + tm/tf* 1/q) • = 2*n2* tf* (1 + tm/tf* 1/2)•For tfand tm, using data from R. Vuduc’s PhD (pp 351-3)• http://bebop.cs.berkeley.edu/pubs/vuduc2003-dissertation.pdf•For tm use minimum-memory-latency / words-per-cache-line Clock Peak Linesize t_m/t_fMHz Mflop/s BytesUltra 2i 333 667 38 66 16 24.8Ultra 3 900 1800 28 200 32 14.0Pentium 3 500 500 25 60 32 6.3Pentium3M800 800 40 60 32 10.0Power3 375 1500 35 139 128 8.8Power4 1300 5200 60 10000 128 15.0Itanium1 800 3200 36 85 32 36.0Itanium2 900 3600 11 60 64 5.5Mem Lat (Min,Max) cyclesmachinebalance(q must be at leastthis for ½ peak speed)1/24/2007 CS267 Lecture 3 8Simplifying Assumptions• What simplifying assumptions did we make in this analysis?• Ignored parallelism in processor between memory and arithmetic within the processor• Sometimes drop arithmetic term in this type of analysis• Assumed fast memory was large enough to hold three vectors• Reasonable if we are talking about any level of cache• Not if we are talking about registers (~32 words)• Assumed the cost of a fast memory access is 0• Reasonable if we are talking about registers• Not necessarily if we are talking about cache (1-2 cycles for L1)• Assumed memory latency is constant• Could simplify even further by ignoring memory operations in X and Y vectors• Mflop rate = flops/sec = 2 / (2* tf+ tm)1/24/2007 CS267 Lecture 3 9Validating the Model• How well does the model predict actual performance? • Actual DGEMV: Most highly optimized code for the platform• Model sufficient to compare across machines• But under-predicting on most recent ones due to latency estimate0200400600800100012001400Ultra 2i Ultra 3 Pentium 3 Pentium3M Power3 Power4 Itanium1 Itanium2MFlop/sPredicted MFLOP(ignoring x,y)Pre DGEMV Mflops(with x,y)Actual DGEMV(MFLOPS)1/24/2007 CS267 Lecture 3 10Outline• Recap from Lecture 2• Memory hierarchy is important to performance• Use of simple performance models to understand performance• Case Study: Matrix Multiplication• Blocking algorithms• Other tuning techniques• Alternate algorithms• Automatic Performance Tuning1/24/2007 CS267 Lecture 3 11High-end simulation in the physical sciences = 7 numerical methods:1. Structured Grids (including adaptive block structured)2. Unstructured Grids3. Spectral Methods4. Dense Linear Algebra5. Sparse Linear Algebra 6. Particle Methods7. Monte CarloPhillip Colella’s “Seven dwarfs”• A dwarf is a pattern of computation and communication• Dwarfs are targets from algorithmic, software, and architecture standpoints Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004 and Dave Patterson, 2006• Dense linear algebra arises in some of the largest computations• It achieves high machine efficiency• There are important subcategories as we will see1/24/2007 CS267 Lecture 3 12Naïve Matrix Multiply{implements C = C + A*B}for i = 1 to nfor j = 1 to nfor k = 1 to nC(i,j) = C(i,j) + A(i,k) * B(k,j)=+*C(i,j) C(i,j)A(i,:)B(:,j)Algorithm has 2*n3= O(n3) Flops and operates on 3*n2words of memoryq potentially as large as 2*n3/ 3*n2= O(n)1/24/2007 CS267 Lecture 3 13Naïve Matrix Multiply{implements C = C + A*B}for i = 1 to n{read row i of A into fast memory}for j = 1 to n{read C(i,j) into fast memory}{read column j of B into fast memory}for k = 1 to nC(i,j) = C(i,j) + A(i,k) * B(k,j){write C(i,j) back to slow memory}=+*C(i,j)A(i,:)B(:,j)C(i,j)1/24/2007 CS267 Lecture 3 14Naïve Matrix MultiplyNumber of slow memory references on unblocked matrix multiplym = n3to read each column of B n times+ n2to read each row of A once + 2n2to read and write each element of C once= n3+ 3n2So q = f / m = 2n3/ (n3+ 3n2)~= 2 for large n, no improvement over matrix-vector multiply=+*C(i,j) C(i,j)A(i,:)B(:,j)1/24/2007 CS267 Lecture 3 15Matrix-multiply, optimized several waysSpeed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops1/24/2007 CS267 Lecture 3 16Naïve


View Full Document

Berkeley COMPSCI C267 - Single Processor Optimizations Matrix Multiplication Case Study

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Single Processor Optimizations Matrix Multiplication Case Study
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Single Processor Optimizations Matrix Multiplication Case Study and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Single Processor Optimizations Matrix Multiplication Case Study 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?