Berkeley COMPSCI C267 - Lecture Notes - D1959417

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture Notes

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture Notes

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 70

This preview shows page 1-2-3-4-5-33-34-35-36-66-67-68-69-70 out of 70 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 70 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Single Processor Machines: Memory Hierarchies and Processor FeaturesMotivationOutlineSlide 4Idealized Uniprocessor ModelUniprocessors in the Real WorldSlide 7What is Pipelining?Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessySIMD: Single Instruction, Multiple DataSSE / SSE2 SIMD on IntelWhat does this mean to you?Slide 13Memory HierarchyProcessor-DRAM Gap (latency)Approaches to Handling Memory LatencyCache BasicsWhy Have Multiple Levels of Cache?Experimental Study of Memory (Membench)Membench: What to ExpectMemory Hierarchy on a Sun Ultra-2iMemory Hierarchy on a Pentium IIIMemory Hierarchy on a Power3 (Seaborg)Memory Performance on Itanium 2 (CITRIS)Stanza TriadStanza Triad ResultsLessonsSlide 28Why Matrix Multiplication?Matrix-multiply, optimized several waysNote on Matrix StorageUsing a Simple Model of Memory to OptimizeWarm up: Matrix-vector multiplicationSlide 34Modeling Matrix-Vector MultiplicationSimplifying AssumptionsValidating the ModelSummaryExtra SlidesNaïve Matrix MultiplySlide 41Slide 42Slide 43Naïve Matrix Multiply on RS/6000Slide 45Blocked (Tiled) Matrix MultiplySlide 47Using Analysis to Understand MachinesLimits to Optimizing Matrix MultiplyBasic Linear Algebra Subroutines (BLAS)BLAS speeds on an IBM RS6000/590Strassen’s Matrix MultiplyStrassen (continued)Other Fast Matrix Multiplication AlgorithmsRecursive Data LayoutsSlide 56Search Over Block SizesWhat the Search Space Looks LikeATLAS (DGEMM n = 500)Tiling Alone Might Not Be EnoughOptimizing in PracticeRemoving False DependenciesExploit Multiple RegistersLoop UnrollingExpose Independent OperationsCopy optimizationLocality in Other AlgorithmsSlide 69Reading for TodayQuestions You Should Be Able to Answer1/19/2007 CS267 Lecture 2 1Single Processor Machines: Memory Hierarchiesand Processor FeaturesKathy [email protected] www.cs.berkeley.edu/~yelick/cs267_spr071/19/2007 CS267 Lecture 2 2Motivation•Most applications run at < 10% of the “peak” performance of a system•Much of this performance is lost on a single processor, i.e., the code running on one processor often runs at only 10-20% of the processor peak•Peak is the maximum the hardware can physically execute•Most of the singe processor performance loss is in the memory system•To understand this, we need to look under the hood of modern processors•For today, we will look at only a single “core” processor•These issues will still exist on multicore•Recall: all programmers will soon be performance programmers1/19/2007 CS267 Lecture 2 3Outline•Idealized and actual costs in modern processors•Parallelism within single processors•Memory hierarchies•Use of microbenchmarks to characterized performance•Case study: Matrix Multiplication•Use of performance models to understand performance1/19/2007 CS267 Lecture 2 4Outline•Idealized and actual costs in modern processors•Parallelism within single processors•Memory hierarchies•Use of microbenchmarks to characterized performance•Case study: Matrix Multiplication•Use of performance models to understand performance1/19/2007 CS267 Lecture 2 5Idealized Uniprocessor Model•Processor names bytes, words, etc. in its address space•These represent integers, floats, pointers, arrays, etc.•Operations include•Read and write into very fast memory called registers•Arithmetic and other logical operations on registers•Order specified by program•Read returns the most recently written data•Compiler and architecture translate high level expressions into “obvious” lower level instructions•Hardware executes instructions in order specified by compiler•Idealized Cost•Each operation has roughly the same cost(read, write, add, multiply, etc.)A = B + C Read address(B) to R1Read address(C) to R2R3 = R1 + R2Write R3 to Address(A)1/19/2007 CS267 Lecture 2 6Uniprocessors in the Real World•Real processors have•registers and caches•small amounts of fast memory•store values of recently used or nearby data•different memory ops can have very different costs•parallelism•multiple “functional units” that can run in parallel•different orders, instruction mixes have different costs•pipelining•a form of parallelism, like an assembly line in a factory•Why is this your problem?•In theory, compilers understand all of this and can optimize your program; in practice they don’t.•Even if they could optimize one algorithm, they won’t know about a different algorithm that might be a much better “match” to the processor1/19/2007 CS267 Lecture 2 7Outline•Idealized and actual costs in modern processors•Parallelism within single processors•Hidden from software (sort of)•Pipelining•SIMD units•Memory hierarchies•Use of microbenchmarks to characterized performance•Case study: Matrix Multiplication•Use of performance models to understand performance8CS267 Lecture 21/19/2007What is Pipelining? •In this example:•Sequential execution takes 4 * 90min = 6 hours•Pipelined execution takes 30+4*40+20 = 3.5 hours•Bandwidth = loads/hour•BW = 4/6 l/h w/o pipelining•BW = 4/3.5 l/h w pipelining•BW <= 1.5 l/h w pipelining, more total loads•Pipelining helps bandwidth but not latency (90 min)•Bandwidth limited by slowest pipeline stage•Potential speedup = Number pipe stagesABCD6 PM7 8 9TaskOrderTime30 40 40 40 40 20Dave Patterson’s Laundry example: 4 people doing laundrywash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency1/19/2007 CS267 Lecture 2 9Example: 5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUX MUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB Data• Pipelining is also used within arithmetic units– a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUX1/19/2007 CS267 Lecture 2 10SIMD: Single Instruction, Multiple Data++•Scalar processing•traditional mode•one operation producesone result•SIMD processing•with SSE / SSE2•one operation producesmultiple results XXYYX + YX + Y++x3x3x2x2x1x1x0x0y3y3y2y2y1y1y0y0x3+y3x3+y3x2+y2x2+y2x1+y1x1+y1x0+y0x0+y0XXYYX + YX + YSlide Source: Alex Klimovitski & Dean Macri, Intel Corporation1/19/2007 CS267 Lecture 2 11SSE / SSE2 SIMD on Intel16x bytes4x floats2x doubles•SSE2 data types: anything that fits into 16

View Full Document

Berkeley COMPSCI C267 - Lecture Notes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-33-34-35-36-66-67-68-69-70 out of 70 pages.

Berkeley COMPSCI C267 - Lecture Notes

Sign up for free to view:

Please select your school