DOC PREVIEW
Berkeley COMPSCI C267 - High Performance Programming on a Single Processor

This preview shows page 1-2-3-4-28-29-30-31-58-59-60-61 out of 61 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 61 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

High Performance Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance TuningOutlineSlide 3Idealized Uniprocessor ModelUniprocessors in the Real WorldWhat is Pipelining?Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessySlide 8Memory HierarchyProcessor-DRAM Gap (latency)Approaches to Handling Memory LatencyCache BasicsWhy Have Multiple Levels of Cache?Experimental Study of Memory (Membench)Membench: What to ExpectMemory Hierarchy on a Sun Ultra-2iMemory Hierarchy on a Pentium IIIMemory Hierarchy on a Power3 (Seaborg)Memory Performance on Itanium 2 (CITRIS)LessonsSlide 21Why Matrix Multiplication?Matrix-multiply, optimized several waysSlide 24Note on Matrix StorageUsing a Simple Model of Memory to OptimizeWarm up: Matrix-vector multiplicationSlide 28Modeling Matrix-Vector MultiplicationSimplifying AssumptionsValidating the ModelNaïve Matrix MultiplySlide 33Slide 34Slide 35Naïve Matrix Multiply on RS/6000Slide 37Blocked (Tiled) Matrix MultiplySlide 39Using Analysis to Understand MachinesLimits to Optimizing Matrix MultiplyBasic Linear Algebra Subroutines (BLAS)BLAS speeds on an IBM RS6000/590Strassen’s Matrix MultiplyStrassen (continued)Recursive Data LayoutsSlide 47Search Over Block SizesWhat the Search Space Looks LikeATLAS (DGEMM n = 500)Tiling Alone Might Not Be EnoughOptimizing in PracticeRemoving False DependenciesExploit Multiple RegistersLoop UnrollingExpose Independent OperationsCopy optimizationLocality in Other AlgorithmsSummaryReading for TodayQuestions You Should Be Able to Answer01/24/2005CS267 Lecure 2 1High Performance Programming on a Single Processor: Memory HierarchiesMatrix MultiplicationAutomatic Performance TuningJames [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr0501/24/2005CS267 Lecure 2 2Outline•Idealized and actual costs in modern processors•Memory hierarchies•Case Study: Matrix Multiplication•Automatic Performance Tuning01/24/2005CS267 Lecure 2 3Outline•Idealized and actual costs in modern processors•Memory hierarchies•Case Study: Matrix Multiplication•Automatic Performance Tuning01/24/2005CS267 Lecure 2 4Idealized Uniprocessor Model•Processor names bytes, words, etc. in its address space•These represent integers, floats, pointers, arrays, etc.•Operations include•Read and write (given an address/pointer)•Arithmetic and other logical operations•Order specified by program•Read returns the most recently written data•Compiler and architecture translate high level expressions into “obvious” lower level instructions•Hardware executes instructions in order specified by compiler•Cost•Each operations has roughly the same cost(read, write, add, multiply, etc.)01/24/2005CS267 Lecure 2 5Uniprocessors in the Real World•Real processors have•registers and caches•small amounts of fast memory•store values of recently used or nearby data•different memory ops can have very different costs•parallelism•multiple “functional units” that can run in parallel•different orders, instruction mixes have different costs•pipelining•a form of parallelism, like an assembly line in a factory•Why is this your problem?In theory, compilers understand all of this and can optimize your program; in practice they don’t.6CS267 Lecure 201/24/2005What is Pipelining? •In this example:•Sequential execution takes 4 * 90min = 6 hours•Pipelined execution takes 30+4*40+20 = 3.5 hours•Bandwidth = loads/hour•BW = 4/6 l/h w/o pipelining•BW = 4/3.5 l/h w pipelining•BW <= 1.5 l/h w pipelining, more total loads•Pipelining helps bandwidth but not latency (90 min)•Bandwidth limited by slowest pipeline stage•Potential speedup = Number pipe stagesABCD6 PM7 8 9TaskOrderTime30 40 40 40 40 20Dave Patterson’s Laundry example: 4 people doing laundrywash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency01/24/2005CS267 Lecure 2 7Example: 5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUX MUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB Data• Pipelining is also used within arithmetic units– a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUX01/24/2005CS267 Lecure 2 8Outline•Idealized and actual costs in modern processors•Memory hierarchies•Case Study: Matrix Multiplication•Automatic Performance Tuning01/24/2005CS267 Lecure 2 9Memory Hierarchy•Most programs have a high degree of locality in their accesses•spatial locality: accessing things nearby previous accesses•temporal locality: reusing an item that was previously accessed•Memory hierarchy tries to exploit localityon-chip cacheregistersdatapathcontrolprocessorSecond level cache (SRAM)Main memory(DRAM)Secondary storage (Disk)Tertiary storage(Disk/Tape)Speed 1ns 10ns 100ns 10ms 10secSize B KB MB GB TB01/24/2005CS267 Lecure 2 10Processor-DRAM Gap (latency)µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moore’s Law”•Memory hierarchies are getting deeper•Processors get faster more quickly than memory01/24/2005CS267 Lecure 2 11Approaches to Handling Memory Latency•Bandwidth has improved more than latency•Approach to address the memory latency problem•Eliminate memory operations by saving values in small, fast memory (cache) and reusing them •need temporal locality in program•Take advantage of better bandwidth by getting a chunk of memory and saving it in small fast memory (cache) and using whole chunk •need spatial locality in program•Take advantage of better bandwidth by allowing processor to issue multiple reads to the memory system at once •concurrency in the instruction stream, eg load whole array, as in vector processors01/24/2005CS267 Lecure 2 12Cache Basics•Cache hit: in-cache memory access—cheap•Cache miss: non-cached memory access—expensive•Need to access next, slower level of cache•Consider a tiny cache (for illustration only)•Cache line length: # of bytes loaded together in one entry•2 in above example•Associativity•direct-mapped: only 1 address (line) in a given range in cache•n-way: n  2 lines with different addresses can be


View Full Document

Berkeley COMPSCI C267 - High Performance Programming on a Single Processor

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download High Performance Programming on a Single Processor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High Performance Programming on a Single Processor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High Performance Programming on a Single Processor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?