DOC PREVIEW
Berkeley COMPSCI C267 - LogP and the Implementation and Modeling of Parallel Sorts

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 43 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel SortsPractical Performance Target (circa 1992)Studies on Parallel SortingThe StudyLogPDeriving the LogP ModelSlide 7Using the ModelUse of the Model (cont)LogP "philosophy"Typical SortSplit-CBasic Costs of operations in Split-CLogP modelSortingLocal Sort Performance (11 bit radix sort of 32 bits numbers)Local Computation Parameters - EmpiricalBottom Line (Preview)Odd-Even Merge - classic parallel sortWhere’s the Parallelism?Mapping to a Butterfly (or Hypercube)Bitonic Sort with N/P per nodeBitonic SortAnalysis of BitonicBitonic Sort: time per keyBitonic: BreakdownBitonic: Effect of Key DistributionsColumn SortColumn Sort: TimesColumn: BreakdownColumn: Key distributionsHisto-radix sortHisto-Radix Sort (again)Radix Sort: TimesRadix: BreakdownRadix: Key distributionRadix: Stream Broadcast ProblemWhat’s the right communication mechanism?Sample SortSample Sort: TimesSample BreakdownComparisonConclusionsCuller 1997CS267 L28 Sort.1CS 267 Applications of Parallel ComputersLecture 28: LogP and the Implementation and Modeling of Parallel SortsJames Demmel(taken from David Culler,Lecture 18, CS267, 1997)http://www.cs.berkeley.edu/~demmel/cs267_Spr99Culler 1997CS267 L28 Sort.2Practical Performance Target (circa 1992)•Sort one billion large keys in one minute on one thousand processors.•Good sort on a workstation can do 1 million keys in about 10 seconds–just fits in memory–16 bit Radix Sort•Performance unit: µs per key per processor– s ~ 10 for single Sparc 2Culler 1997CS267 L28 Sort.3Studies on Parallel SortingSorting NetworksPRAM SortsMEMp p p°°°Sorting on Network YPMnetworkPMPM°°°LogP SortsSorting onMachine XCuller 1997CS267 L28 Sort.4The StudyInteresting ParallelSorting AlgorithmsAnalyze under LogPParametersfor CM-5Estimate ExecutionTimeImplement in Split-CExecute on CM-5Compare??(Bitonic, Column, Histo- radix, Sample)Culler 1997CS267 L28 Sort.5LogPCuller 1997CS267 L28 Sort.6 Deriving the LogP Model° Processing– powerful microprocessor, large DRAM, cache => P° Communication+ significant latency (100's of cycles) => L+ limited bandwidth (1 – 5% of memory bw) => g+ significant overhead (10's – 100's of cycles) => o- on both ends– no consensus on topology=> should not exploit structure+ limited capacity– no consensus on programming model=> should not enforce oneCuller 1997CS267 L28 Sort.7LogPInterconnection NetworkMPMPMP° ° °P ( processors )Limited Volume( L/ g to or from a proc)o (overhead)L (latency)og (gap)•Latency in sending a (small) mesage between modules•overhead felt by the processor on sending or receiving msg•gap between successive sends or receives (1/BW)•ProcessorsCuller 1997CS267 L28 Sort.8 Using the Model° Send n messages from proc to proc in time 2o + L + g(n-1)– each processor does o n cycles of overhead– has (g-o)(n-1) + L available compute cycles° Send n messages from one to many in same time° Send n messages from many to one in same time– all but L/g processors block so fewer available cycleso LooogLtimePPCuller 1997CS267 L28 Sort.9Use of the Model (cont)° Two processors sending n words to each other (i.e., exchange) in time2o + L + max(g,2o) (n-1)  max(g,2o) + L° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclicsameexercise: what’s wrong with the formula above? Assumes optimal pattern of send/receive, so could underestimate timeCuller 1997CS267 L28 Sort.10LogP "philosophy"•Think about:•– mapping of N words onto P processors•– computation within a processor, its cost, and balance •– communication between processors, its cost, and balance•given a charaterization of processor and network performance•Do not think about what happens within the networkThis should be good enough!Culler 1997CS267 L28 Sort.11Typical SortExploits the n = N/P grouping° Significant local computation° Very general global communication / transformation° Computation of the transformationCuller 1997CS267 L28 Sort.12Split-CGlobal Address SpaceP0Pprocs-1P1local•Explicitly parallel C•2D global address space–linear ordering on local spaces•Local and Global pointers –spread arrays too•Read/Write•Get/Put (overap compute and comm)–x := G; . . .–sync();•Signaling store (one-way)–G :– x; . . .–store_sync(); or all_store_sync();•Bulk transfer •Global comm.Culler 1997CS267 L28 Sort.13Basic Costs of operations in Split-C•Read, Write x = *G, *G = x 2 (L + 2o)•Store *G :– x L + 2o•Get x := *G o .... 2L + 2o sync(); o–with interval g•Bulk store (n words with  words/message) 2o + (n-1)g + L•Exchange 2o + 2L + (n  L/g) max(g,2o)•One to many•Many to oneCuller 1997CS267 L28 Sort.14LogP model•CM5:–L = 6 µs–o = 2.2 µs–g = 4 µs–P varies from 32 to 1024•NOW–L = 8.9–o = 3.8–g = 12.8–P varies up to 100 •What is the processor performance?Culler 1997CS267 L28 Sort.15SortingCuller 1997CS267 L28 Sort.16Local Sort Performance (11 bit radix sort of 32 bits numbers)Log N/Pµs / Key0123456789100 5 1015 203125.116.910.46.2Entropy inKey ValuesEntropy = -i pi log pi , pi = Probability of key i<--------- TLB misses ---------->Culler 1997CS267 L28 Sort.17Local Computation Parameters - EmpiricalParameter Operation µs per key SortSwap Simulate cycle butterfly per key 0.025 lg N Bitonicmergesort Sort bitonic sequence 1.0scatter Move key for Cyclic-to-block 0.46gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwiselocal sort Local radix sort (11 bit) 4.5 if n < 64K 9.0 - (281000/n)merge Merge sorted lists 1.5 Columncopy Shift Key 0.5zero Clear histogram bin 0.2 Radixhist produce histogram 1.2add produce scan value 1.0bsum adjust scan of bins 2.5address determine desitination 4.7compare compare key to splitter 0.9 Samplelocalsort8 local radix sort of samples 5.0Culler 1997CS267 L28 Sort.18Bottom Line (Preview)N/Pus/key0.0020.0040.0060.0080.00100.00120.00140.001638432768655361310722621445242881048576Bitonic 1024Bitonic 32Column 1024Column 32Radix 1024Radix 32Sample 1024Sample 32•Good fit between predicted and measured


View Full Document

Berkeley COMPSCI C267 - LogP and the Implementation and Modeling of Parallel Sorts

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download LogP and the Implementation and Modeling of Parallel Sorts
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LogP and the Implementation and Modeling of Parallel Sorts and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LogP and the Implementation and Modeling of Parallel Sorts 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?