DOC PREVIEW
Berkeley COMPSCI C267 - Lecture Notes

This preview shows page 1-2-3-4-5-33-34-35-36-66-67-68-69-70 out of 70 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 70 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Single Processor Machines: Memory Hierarchies and Processor FeaturesMotivationOutlineSlide 4Idealized Uniprocessor ModelUniprocessors in the Real WorldSlide 7What is Pipelining?Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessySIMD: Single Instruction, Multiple DataSSE / SSE2 SIMD on IntelWhat does this mean to you?Slide 13Memory HierarchyProcessor-DRAM Gap (latency)Approaches to Handling Memory LatencyCache BasicsWhy Have Multiple Levels of Cache?Experimental Study of Memory (Membench)Membench: What to ExpectMemory Hierarchy on a Sun Ultra-2iMemory Hierarchy on a Pentium IIIMemory Hierarchy on a Power3 (Seaborg)Memory Performance on Itanium 2 (CITRIS)Stanza TriadStanza Triad ResultsLessonsSlide 28Why Matrix Multiplication?Matrix-multiply, optimized several waysNote on Matrix StorageUsing a Simple Model of Memory to OptimizeWarm up: Matrix-vector multiplicationSlide 34Modeling Matrix-Vector MultiplicationSimplifying AssumptionsValidating the ModelSummaryExtra SlidesNaïve Matrix MultiplySlide 41Slide 42Slide 43Naïve Matrix Multiply on RS/6000Slide 45Blocked (Tiled) Matrix MultiplySlide 47Using Analysis to Understand MachinesLimits to Optimizing Matrix MultiplyBasic Linear Algebra Subroutines (BLAS)BLAS speeds on an IBM RS6000/590Strassen’s Matrix MultiplyStrassen (continued)Other Fast Matrix Multiplication AlgorithmsRecursive Data LayoutsSlide 56Search Over Block SizesWhat the Search Space Looks LikeATLAS (DGEMM n = 500)Tiling Alone Might Not Be EnoughOptimizing in PracticeRemoving False DependenciesExploit Multiple RegistersLoop UnrollingExpose Independent OperationsCopy optimizationLocality in Other AlgorithmsSlide 69Reading for TodayQuestions You Should Be Able to Answer1/19/2007 CS267 Lecture 2 1Single Processor Machines: Memory Hierarchiesand Processor FeaturesKathy [email protected] www.cs.berkeley.edu/~yelick/cs267_spr071/19/2007 CS267 Lecture 2 2Motivation•Most applications run at < 10% of the “peak” performance of a system•Much of this performance is lost on a single processor, i.e., the code running on one processor often runs at only 10-20% of the processor peak•Peak is the maximum the hardware can physically execute•Most of the singe processor performance loss is in the memory system•To understand this, we need to look under the hood of modern processors•For today, we will look at only a single “core” processor•These issues will still exist on multicore•Recall: all programmers will soon be performance programmers1/19/2007 CS267 Lecture 2 3Outline•Idealized and actual costs in modern processors•Parallelism within single processors•Memory hierarchies•Use of microbenchmarks to characterized performance•Case study: Matrix Multiplication•Use of performance models to understand performance1/19/2007 CS267 Lecture 2 4Outline•Idealized and actual costs in modern processors•Parallelism within single processors•Memory hierarchies•Use of microbenchmarks to characterized performance•Case study: Matrix Multiplication•Use of performance models to understand performance1/19/2007 CS267 Lecture 2 5Idealized Uniprocessor Model•Processor names bytes, words, etc. in its address space•These represent integers, floats, pointers, arrays, etc.•Operations include•Read and write into very fast memory called registers•Arithmetic and other logical operations on registers•Order specified by program•Read returns the most recently written data•Compiler and architecture translate high level expressions into “obvious” lower level instructions•Hardware executes instructions in order specified by compiler•Idealized Cost•Each operation has roughly the same cost(read, write, add, multiply, etc.)A = B + C Read address(B) to R1Read address(C) to R2R3 = R1 + R2Write R3 to Address(A)1/19/2007 CS267 Lecture 2 6Uniprocessors in the Real World•Real processors have•registers and caches•small amounts of fast memory•store values of recently used or nearby data•different memory ops can have very different costs•parallelism•multiple “functional units” that can run in parallel•different orders, instruction mixes have different costs•pipelining•a form of parallelism, like an assembly line in a factory•Why is this your problem?•In theory, compilers understand all of this and can optimize your program; in practice they don’t.•Even if they could optimize one algorithm, they won’t know about a different algorithm that might be a much better “match” to the processor1/19/2007 CS267 Lecture 2 7Outline•Idealized and actual costs in modern processors•Parallelism within single processors•Hidden from software (sort of)•Pipelining•SIMD units•Memory hierarchies•Use of microbenchmarks to characterized performance•Case study: Matrix Multiplication•Use of performance models to understand performance8CS267 Lecture 21/19/2007What is Pipelining? •In this example:•Sequential execution takes 4 * 90min = 6 hours•Pipelined execution takes 30+4*40+20 = 3.5 hours•Bandwidth = loads/hour•BW = 4/6 l/h w/o pipelining•BW = 4/3.5 l/h w pipelining•BW <= 1.5 l/h w pipelining, more total loads•Pipelining helps bandwidth but not latency (90 min)•Bandwidth limited by slowest pipeline stage•Potential speedup = Number pipe stagesABCD6 PM7 8 9TaskOrderTime30 40 40 40 40 20Dave Patterson’s Laundry example: 4 people doing laundrywash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency1/19/2007 CS267 Lecture 2 9Example: 5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUX MUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB Data• Pipelining is also used within arithmetic units– a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUX1/19/2007 CS267 Lecture 2 10SIMD: Single Instruction, Multiple Data++•Scalar processing•traditional mode•one operation producesone result•SIMD processing•with SSE / SSE2•one operation producesmultiple results XXYYX + YX + Y++x3x3x2x2x1x1x0x0y3y3y2y2y1y1y0y0x3+y3x3+y3x2+y2x2+y2x1+y1x1+y1x0+y0x0+y0XXYYX + YX + YSlide Source: Alex Klimovitski & Dean Macri, Intel Corporation1/19/2007 CS267 Lecture 2 11SSE / SSE2 SIMD on Intel16x bytes4x floats2x doubles•SSE2 data types: anything that fits into 16


View Full Document

Berkeley COMPSCI C267 - Lecture Notes

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?