DOC PREVIEW
Berkeley COMPSCI C267 - Single Processor Machines: Memory Hierarchies and Processor Features

This preview shows page 1-2-3-4-5-33-34-35-36-67-68-69-70-71 out of 71 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 71 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1/19/2007 CS267 Lecture 2 1Single Processor Machines: Memory Hierarchiesand Processor FeaturesKathy [email protected]/~yelick/cs267_spr071/19/2007 CS267 Lecture 2 2Motivation• Most applications run at < 10% of the “peak” performance of a system• Much of this performance is lost on a single processor, i.e., the code running on one processor often runs at only 10-20% of the processor peak• Peak is the maximum the hardware can physically execute• Most of the singe processor performance loss is in the memory system• To understand this, we need to look under the hood of modern processors• For today, we will look at only a single “core” processor• These issues will still exist on multicore• Recall: all programmers will soon be performance programmers1/19/2007 CS267 Lecture 2 3Outline• Idealized and actual costs in modern processors• Parallelism within single processors• Memory hierarchies• Use of microbenchmarks to characterized performance• Case study: Matrix Multiplication• Use of performance models to understand performance1/19/2007 CS267 Lecture 2 4Outline• Idealized and actual costs in modern processors• Parallelism within single processors• Memory hierarchies• Use of microbenchmarks to characterized performance• Case study: Matrix Multiplication• Use of performance models to understand performance1/19/2007 CS267 Lecture 2 5Idealized Uniprocessor Model• Processor names bytes, words, etc. in its address space• These represent integers, floats, pointers, arrays, etc.• Operations include• Read and write into very fast memory called registers• Arithmetic and other logical operations on registers• Order specified by program• Read returns the most recently written data• Compiler and architecture translate high level expressions into “obvious” lower level instructions• Hardware executes instructions in order specified by compiler• Idealized Cost• Each operation has roughly the same cost(read, write, add, multiply, etc.)A = B + C ⇒Read address(B) to R1Read address(C) to R2R3 = R1 + R2Write R3 to Address(A)1/19/2007 CS267 Lecture 2 6Uniprocessors in the Real World• Real processors have• registers and caches• small amounts of fast memory• store values of recently used or nearby data• different memory ops can have very different costs• parallelism• multiple “functional units” that can run in parallel• different orders, instruction mixes have different costs• pipelining• a form of parallelism, like an assembly line in a factory• Why is this your problem?• In theory, compilers understand all of this and can optimize your program; in practice they don’t.• Even if they could optimize one algorithm, they won’t know about a different algorithm that might be a much better “match” to the processor1/19/2007 CS267 Lecture 2 7Outline• Idealized and actual costs in modern processors• Parallelism within single processors• Hidden from software (sort of)• Pipelining• SIMD units• Memory hierarchies• Use of microbenchmarks to characterized performance• Case study: Matrix Multiplication• Use of performance models to understand performance1/19/2007 CS267 Lecture 2 8What is Pipelining? • In this example:• Sequential execution takes 4 * 90min = 6 hours• Pipelined execution takes 30+4*40+20 = 3.5 hours• Bandwidth = loads/hour• BW = 4/6 l/h w/o pipelining• BW = 4/3.5 l/h w pipelining• BW <= 1.5 l/h w pipelining, more total loads• Pipelining helps bandwidthbut not latency (90 min)• Bandwidth limited by slowestpipeline stage• Potential speedup = Number pipe stagesABCD6 PM789TaskOrderTime30 40 40 40 40 20Dave Patterson’s Laundry example: 4 people doing laundrywash (30 min) + dry (40 min) + fold (20 min) = 90 min Latency1/19/2007 CS267 Lecture 2 9Example: 5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e by Patterson and HennessyMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcALUMemoryReg FileMUX MUXDataMemoryMUXSignExtendZero?IF/IDID/EXMEM/WBEX/MEM4AdderNext SEQ PCNext SEQ PCRD RD RDWB Data• Pipelining is also used within arithmetic units– a fp multiply may have latency 10 cycles, but throughput of 1/cycleNext PCAddressRS1RS2ImmMUX1/19/2007 CS267 Lecture 2 10SIMD: Single Instruction, Multiple Data++• Scalar processing• traditional mode• one operation producesone result• SIMD processing• with SSE / SSE2• one operation producesmultiple resultsXXYYX + YX + Y++x3x3x2x2x1x1x0x0y3y3y2y2y1y1y0y0x3+y3x3+y3x2+y2x2+y2x1+y1x1+y1x0+y0x0+y0XXYYX + YX + YSlide Source: Alex Klimovitski & Dean Macri, Intel Corporation1/19/2007 CS267 Lecture 2 11SSE / SSE2 SIMD on Intel16x bytes4x floats2x doubles• SSE2 data types: anything that fits into 16 bytes, e.g.,• Instructions perform add, multiply etc. on all the data in this 16-byte register in parallel• Challenges:• Need to be contiguous in memory and aligned• Some instructions to move data around from one part of register to another1/19/2007 CS267 Lecture 2 12What does this mean to you?• In addition to SIMD extensions, the processor may have other special instructions• Fused Multiply-Add (FMA) instructions: x = y + c * zis so common some processor execute the multiply/add as a single instruction, at the same rate (bandwidth) as + or * alone• In theory, the compiler understands all of this• When compiling, it will rearrange instructions to get a good “schedule” that maximizes pipelining, uses FMAs and SIMD• It works with the mix of instructions inside an inner loop or other block of code• But in practice the compiler may need your help• Choose a different compiler, optimization flags, etc.• Rearrange your code to make things more obvious• Using special functions (“intrinsics”) or write in assembly /1/19/2007 CS267 Lecture 2 13Outline• Idealized and actual costs in modern processors• Parallelism within single processors• Memory hierarchies• Temporal and spatial locality• Basics of caches• Use of microbenchmarks to characterized performance• Case study: Matrix Multiplication• Use of performance models to understand performance1/19/2007 CS267 Lecture 2 14Memory Hierarchy• Most programs have a high degree of locality in their accesses• spatial locality: accessing things nearby previous accesses• temporal locality: reusing an item that was previously accessed• Memory hierarchy tries to exploit


View Full Document

Berkeley COMPSCI C267 - Single Processor Machines: Memory Hierarchies and Processor Features

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Single Processor Machines: Memory Hierarchies and Processor Features
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Single Processor Machines: Memory Hierarchies and Processor Features and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Single Processor Machines: Memory Hierarchies and Processor Features 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?