Berkeley COMPSCI C267 - Lecture Notes - D932397

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> Lecture Notes

DOC PREVIEW

Berkeley COMPSCI C267 - Lecture Notes

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 60

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

4/20/2007 CS267, Yelick 1CS 267: Applications of Parallel ComputersPerformance Modeling and AnalysisKathy Yelickhttp://www.cs.berkeley.edu/~yelick/cs2674/20/2007 CS267, Yelick 2Outline• Some Performance Laws• Performance analysis• Performance modeling• Parallel Sorting: combining models with measurments• Reading: • Chapter 3 of Foster’s “Designing and Building Parallel Programs”online text• http://www-unix.mcs.anl.gov/dbpp/text/node26.html• Abbreviated as DBPP in this lecture• David Bailey’s “Twelve Ways to Fool the Masses”4/20/2007 CS267, Yelick 3Measuring Performance• Performance criterion may vary with domain• There may be limits on acceptable running time• E.g., a climate model must run 1000x faster than real time.• Any performance improvement may be acceptable• E.g., faster on 4 cores than on 1. • Throughout may be more critical than latency• E.g., number of images processed per minute (throughput) vs. total delay for one image (latency) in a pipelined system.• Execution time per unit cost• E.g., GFlop/sec, GFlop/s/$ or GFlop/s/Watt• Parallel scalability (speedup or parallel efficiency)• Percent relative to best possible (some kind of peak)4/20/2007 CS267, Yelick 4Amdahl’s Law (review)• Suppose only part of an application seems parallel• Amdahl’s law• let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable• P = number of processorsSpeedup(P) = Time(1)/Time(P)<= 1/(s + (1-s)/P) <= 1/s• Even if the parallel part speeds up perfectly performance is limited by the sequential part4/20/2007 CS267, Yelick 5Amdahl’s Law (for 1024 processors)Does this mean parallel computing is a hopeless enterprise?Source: Gustafson, Montry, Benner4/20/2007 CS267, Yelick 6Scaled SpeedupSee: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.4/20/2007 CS267, Yelick 7Scaled Speedup (background)4/20/2007 CS267, Yelick 8Little’s Law• Latency vs. Bandwidth• Latency is physics (wire length)• e.g., the network latency on the Earth Simulation is only about 2x times the speed of light across the machine room• Bandwidth is cost: • add more cables to increase bandwidth (over-simplification)• Principle (Little's Law): the relationship of a production system in steady state is: Inventory = Throughput × Flow Time• For parallel computing, Little’s Law is about the required concurrency to be limited by bandwidth rather than latency• Required concurrency = Bandwidth * Latencybandwidth-delay product• For parallel computing, this means:Concurrency = bandwidth x latency4/20/2007 CS267, Yelick 9Little’s Law• Example 1: a single processor:• If the latency is to memory is 50ns, and the bandwidth is 5 GB/s(.2ns / Bytes = 12.8 ns / 64-byte cache line)• The system must support 50/12.8 ~= 4 outstanding cache line misses to keep things balanced (run at bandwidth speed)• An application must be able to prefetch 4 cache line misses in parallel (without dependencies between them)• Example 2: 1000 processor system• 1 GHz clock, 100 ns memory latency, 100 words of memory in data paths between CPU and memory.• Main memory bandwidth is:~ 1000 x 100 words x 109/s = 1014words/sec.• To achieve full performance, an application needs:~ 10-7x 1014= 107way concurrency(some of that may be hidden in the instruction stream)4/20/2007 CS267, Yelick 10In Performance Analysis: Use more Data• Whenever possible, use a large set of data rather than one or a few isolated points. A single point has little information.• E.g., from DBPP: • Serial algorithm scales as N + N2• Speedup of 10.8 on 12 processors with problem size N=100• Case 1: T = N + N2/P• Case 2: T = (N + N2)/P + 100• Case 2: T = (N + N2)/P + 0.6*P2• All have speedup ~10.8 on 12 procs• Performance graphs (n = 100, 1000) show differences in scaling4/20/2007 CS267, Yelick 11Example: Immersed Boundary SimulationJoint work with Ed Givelberg, Armando Solar-LezamaTime per timestep 0204060801001248163264128# procstime (secs)Pow3/SP 256^3Pow3/SP 512^3P4/Myr 512^2x256• Using Seaborg (Power3) at NERSC and DataStar (Power4) at SDSC• How useful is this data? What are ways to make is more useful/interesting?4/20/2007 CS267, Yelick 12Performance Analysistime breakdown0%10%20%30%40%50%60%70%80%90%100%256 on 2256 on 4256 on 8256 on 16256 on 32256 on 64512 on 32512 on 64512 on 128move Unpack USend velocitiesPack UCopy fluid UInverse FFTsSolveXformXEqnsForward FFTsUpwindExchange ghostUnpack FSet F = 0Send FPack FSpread FCompute F4/20/2007 CS267, Yelick 13Building a Performance Model• Based on measurements/scaling of components• FFT is time is: • 5*nlogn flops * flops/sec (measured for FFT)• Other costs are linear in either material or fluid points• Measure constants: a) # flops/point (independent machine or problem size)b) Flops/sec (measured per machine, per phase)• Time is: a * b * #points• Communication done similarly• Find formula for message size as function of problem size• Check the formula using tracing of some kind• Use α/β model to predict running time: α + β * size4/20/2007 CS267, Yelick 14A Performance Model• 5123in < 1 second per timestep not possible • Primarily limited by bisection bandwidthPerformance Model Validation0.1110100100024816326412825651210242048# procstime (secs)Total time (256 model)Total time (256 actual)Total time (512 model)Total time (512 actual)4/20/2007 CS267, Yelick 15Model Success and Failure4/20/2007 CS267, Yelick 16OSKI SPMV: What We Expect• Assume• Cost(SpMV) = time to read matrix• 1 double-word = 2 integers• r, c in {1, 2, 4, 8}• CSR: 1 int / non-zero• BCSR(r x c): 1 int / (r*c non-zeros)• As r*c increases, speedup should• Increase smoothly• Approach 1.5 (eliminate all index overhead)5.1115.1),(,⎯⎯→⎯+≈=∞=crBCSRCSRrccrTTSpeedup4/20/2007 CS267, Yelick 17What We Get (The Need for Search)ReferenceBest: 4x2Mflop/sMflop/s4/20/2007 CS267, Yelick 18Using Multiple Models4/20/2007 CS267, Yelick 19Multiple Models4/20/2007 CS267, Yelick 20Multiple ModelsExtended ExampleUsing Performance Modeling (LogP)To Explain DataApplication to SortingDeriving the LogP Model° Processing– powerful microprocessor, large DRAM, cache => P° Communication+ significant latency (100's –1000’s of cycles) => L+ limited bandwidth (1 – 5% of memory bw) => g+

View Full Document

Berkeley COMPSCI C267 - Lecture Notes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Berkeley COMPSCI C267 - Lecture Notes

Sign up for free to view:

Please select your school