Berkeley COMPSCI C267 - Uniprocessor Optimizations and Matrix Multiplication - D1644244

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Uniprocessor Optimizations and Matrix Multiplication

DOC PREVIEW

Berkeley COMPSCI C267 - Uniprocessor Optimizations and Matrix Multiplication

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 38

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Uniprocessor Optimizations and Matrix MultiplicationOutlineIdealized Uniprocessor ModelUniprocessors in the Real WorldWhat is Pipelining?Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyLimits to Instruction Level Parallelism (ILP)Dependences (Data Hazards) Limit ParallelismSlide 9Memory HierarchyProcessor-DRAM Gap (latency)Cache BasicsExperimental Study of MemoryMemory Hierarchy on a Sun Ultra-IIiMemory Hierarchy on a Pentium IIILessonsSlide 17Note on Matrix StorageUsing a Simple Model of Memory to OptimizeWarm up: Matrix-vector multiplicationSlide 21“Naïve” Matrix MultiplySlide 23Slide 24Blocked (Tiled) Matrix MultiplySlide 26Limits to Optimizing Matrix MultiplyBasic Linear Algebra SubroutinesBLAS speeds on an IBM RS6000/590Locality in Other AlgorithmsSlide 31Tiling Alone Might Not Be EnoughOptimizing in PracticePHiPAC: Portable High Performance ANSI CATLAS (DGEMM n = 500)SummaryPowerPoint PresentationObserving a Memory Hierarchy01/14/19 CS267 Lecure 2 1Uniprocessor Optimizations and Matrix Multiplication Kathy Yelickhttp://inst.eecs.berkeley.edu/~cs26701/14/19 CS267 Lecure 2 2Outline•Parallelism in Modern Processors•Memory Hierarchies•Matrix Multiply Cache Optimizations•Bag of Tricks01/14/19 CS267 Lecure 2 3Idealized Uniprocessor Model•Processor names bytes, words, etc. in its address space•These represent integers, floats, pointers, arrays, etc.•Exist in the program stack, static region, or heap•Operations include•Read and write (given an address/pointer)•Arithmetic and other logical operations•Order specified by program•Read returns the most recently written data•Compiler and architecture translate high level expressions into “obvious” lower level instructions•Hardware executes instructions in order specified by compiler•Cost•Each operations has roughly the same cost(read, write, add, multiply, etc.)01/14/19 CS267 Lecure 2 4Uniprocessors in the Real World•Real processors have•registers and caches•small amounts of fast memory•store values of recently used or nearby data•different memory ops can have very different costs•parallelism•multiple “functional units” that can run in parallel•different orders, instruction mixes have different costs•pipelining•a form of parallelism, like an assembly line in a factory•Why is this your problem?In theory, compilers understand all of this and can optimize your program; in practice they don’t.5CS267 Lecure 201/14/19What is Pipelining? •In this example:•Sequential execution takes 4 * 90min = 6 hours•Pipelined execution takes 30+4*40+20 = 3.3 hours•Pipelining helps throughput, but not latency •Pipeline rate limited by slowest pipeline stage•Potential speedup = Number pipe stages•Time to “fill” pipeline and time to “drain” it reduces speedupABCD6 PM7 8 9TaskOrderTime30 40 40 40 40 20Dave Patterson’s Laundry example: 4 people doing laundrywash (30 min) + dry (40 min) + fold (20 min)01/14/19 CS267 Lecure 2 6Example: 5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUX MUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB Data• Pipelining is also used within arithmetic units– a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUX7CS267 Lecure 201/14/19Limits to Instruction Level Parallelism (ILP)•Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle•Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)•Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)•Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).•The hardware and compiler will try to reduce these:•Reordering instructions, multiple issue, dynamic branch prediction, speculative execution…•You can also enable parallelism by careful coding01/14/19 CS267 Lecure 2 8Dependences (Data Hazards) Limit Parallelism•A dependence or data hazard is one of the following:•true of flow dependence:•a writes a location that b later reads•(read-after write or RAW hazard)•anti-dependence•a reads a location that b later writes•(write-after-read or WAR hazard)•output dependence•a writes a location that b later writes•(write-after-write or WAW hazard)true anti outputa = = a a = a = = a a =01/14/19 CS267 Lecure 2 9Outline•Parallelism in Modern Processors•Memory Hierarchies•Matrix Multiply Cache Optimizations•Bag of Tricks01/14/19 CS267 Lecure 2 10Memory Hierarchy•Most programs have a high degree of locality in their accesses•spatial locality: accessing things nearby previous accesses•temporal locality: reusing an item that was previously accessed•Memory hierarchy tries to exploit localityon-chip cacheregistersdatapathcontrolprocessorSecond level cache (SRAM)Main memory(DRAM)Secondary storage (Disk)Tertiary storage(Disk/Tape)Speed (ns): 1 10 100 10 ms 10 secSize (bytes): 100s Ks Ms Gs Ts01/14/19 CS267 Lecure 2 11Processor-DRAM Gap (latency)µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moore’s Law”•Memory hierarchies are getting deeper•Processors get faster more quickly than memory01/14/19 CS267 Lecure 2 12Cache BasicsX000 X001X010 X011X100 X101X110 X111•Cache hit: in-cache memory access—cheap•Cache miss: non-cached memory access—expensive•Consider a tiny cache (for illustration only)•Cache line length: # of bytes loaded together in one entry•Associativity•direct-mapped: only one address (line) in a given range in cache•n-way: 2 or more lines with different addresses exist01/14/19 CS267 Lecure 2 13Experimental Study of Memory•Microbenchmark for memory system performance• time the following program for each size(A) and stride s (repeat to obtain confidence and mitigate timer resolution) for array A of size from 4KB to 8MB by 2x for stride s from 8 Bytes (1 word) to size(A)/2 by 2x

View Full Document

Berkeley COMPSCI C267 - Uniprocessor Optimizations and Matrix Multiplication

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Berkeley COMPSCI C267 - Uniprocessor Optimizations and Matrix Multiplication

Sign up for free to view:

Please select your school