Berkeley COMPSCI C267 - LogP and the Implementation and Modeling of Parallel Sorts - D3047529

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C267> LogP and the Implementation and Modeling of Parallel Sorts

DOC PREVIEW

Berkeley COMPSCI C267 - LogP and the Implementation and Modeling of Parallel Sorts

School name University of California, Berkeley

Course Compsci C267- Applications of Parallel Computers

Pages 43

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 43 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel SortsPractical Performance Target (circa 1992)Studies on Parallel SortingThe StudyLogPDeriving the LogP ModelSlide 7Using the ModelUse of the Model (cont)LogP "philosophy"Typical SortSplit-CBasic Costs of operations in Split-CLogP modelSortingLocal Sort Performance (11 bit radix sort of 32 bits numbers)Local Computation Parameters - EmpiricalBottom Line (Preview)Odd-Even Merge - classic parallel sortWhere’s the Parallelism?Mapping to a Butterfly (or Hypercube)Bitonic Sort with N/P per nodeBitonic SortAnalysis of BitonicBitonic Sort: time per keyBitonic: BreakdownBitonic: Effect of Key DistributionsColumn SortColumn Sort: TimesColumn: BreakdownColumn: Key distributionsHisto-radix sortHisto-Radix Sort (again)Radix Sort: TimesRadix: BreakdownRadix: Key distributionRadix: Stream Broadcast ProblemWhat’s the right communication mechanism?Sample SortSample Sort: TimesSample BreakdownComparisonConclusionsCuller 1997CS267 L28 Sort.1CS 267 Applications of Parallel ComputersLecture 28: LogP and the Implementation and Modeling of Parallel SortsJames Demmel(taken from David Culler,Lecture 18, CS267, 1997)http://www.cs.berkeley.edu/~demmel/cs267_Spr99Culler 1997CS267 L28 Sort.2Practical Performance Target (circa 1992)•Sort one billion large keys in one minute on one thousand processors.•Good sort on a workstation can do 1 million keys in about 10 seconds–just fits in memory–16 bit Radix Sort•Performance unit: µs per key per processor– s ~ 10 for single Sparc 2Culler 1997CS267 L28 Sort.3Studies on Parallel SortingSorting NetworksPRAM SortsMEMp p p°°°Sorting on Network YPMnetworkPMPM°°°LogP SortsSorting onMachine XCuller 1997CS267 L28 Sort.4The StudyInteresting ParallelSorting AlgorithmsAnalyze under LogPParametersfor CM-5Estimate ExecutionTimeImplement in Split-CExecute on CM-5Compare??(Bitonic, Column, Histo- radix, Sample)Culler 1997CS267 L28 Sort.5LogPCuller 1997CS267 L28 Sort.6 Deriving the LogP Model° Processing– powerful microprocessor, large DRAM, cache => P° Communication+ significant latency (100's of cycles) => L+ limited bandwidth (1 – 5% of memory bw) => g+ significant overhead (10's – 100's of cycles) => o- on both ends– no consensus on topology=> should not exploit structure+ limited capacity– no consensus on programming model=> should not enforce oneCuller 1997CS267 L28 Sort.7LogPInterconnection NetworkMPMPMP° ° °P ( processors )Limited Volume( L/ g to or from a proc)o (overhead)L (latency)og (gap)•Latency in sending a (small) mesage between modules•overhead felt by the processor on sending or receiving msg•gap between successive sends or receives (1/BW)•ProcessorsCuller 1997CS267 L28 Sort.8 Using the Model° Send n messages from proc to proc in time 2o + L + g(n-1)– each processor does o n cycles of overhead– has (g-o)(n-1) + L available compute cycles° Send n messages from one to many in same time° Send n messages from many to one in same time– all but L/g processors block so fewer available cycleso LooogLtimePPCuller 1997CS267 L28 Sort.9Use of the Model (cont)° Two processors sending n words to each other (i.e., exchange) in time2o + L + max(g,2o) (n-1)  max(g,2o) + L° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclicsameexercise: what’s wrong with the formula above? Assumes optimal pattern of send/receive, so could underestimate timeCuller 1997CS267 L28 Sort.10LogP "philosophy"•Think about:•– mapping of N words onto P processors•– computation within a processor, its cost, and balance •– communication between processors, its cost, and balance•given a charaterization of processor and network performance•Do not think about what happens within the networkThis should be good enough!Culler 1997CS267 L28 Sort.11Typical SortExploits the n = N/P grouping° Significant local computation° Very general global communication / transformation° Computation of the transformationCuller 1997CS267 L28 Sort.12Split-CGlobal Address SpaceP0Pprocs-1P1local•Explicitly parallel C•2D global address space–linear ordering on local spaces•Local and Global pointers –spread arrays too•Read/Write•Get/Put (overap compute and comm)–x := G; . . .–sync();•Signaling store (one-way)–G :– x; . . .–store_sync(); or all_store_sync();•Bulk transfer •Global comm.Culler 1997CS267 L28 Sort.13Basic Costs of operations in Split-C•Read, Write x = *G, *G = x 2 (L + 2o)•Store *G :– x L + 2o•Get x := *G o .... 2L + 2o sync(); o–with interval g•Bulk store (n words with  words/message) 2o + (n-1)g + L•Exchange 2o + 2L + (n  L/g) max(g,2o)•One to many•Many to oneCuller 1997CS267 L28 Sort.14LogP model•CM5:–L = 6 µs–o = 2.2 µs–g = 4 µs–P varies from 32 to 1024•NOW–L = 8.9–o = 3.8–g = 12.8–P varies up to 100 •What is the processor performance?Culler 1997CS267 L28 Sort.15SortingCuller 1997CS267 L28 Sort.16Local Sort Performance (11 bit radix sort of 32 bits numbers)Log N/Pµs / Key0123456789100 5 1015 203125.116.910.46.2Entropy inKey ValuesEntropy = -i pi log pi , pi = Probability of key i<--------- TLB misses ---------->Culler 1997CS267 L28 Sort.17Local Computation Parameters - EmpiricalParameter Operation µs per key SortSwap Simulate cycle butterfly per key 0.025 lg N Bitonicmergesort Sort bitonic sequence 1.0scatter Move key for Cyclic-to-block 0.46gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwiselocal sort Local radix sort (11 bit) 4.5 if n < 64K 9.0 - (281000/n)merge Merge sorted lists 1.5 Columncopy Shift Key 0.5zero Clear histogram bin 0.2 Radixhist produce histogram 1.2add produce scan value 1.0bsum adjust scan of bins 2.5address determine desitination 4.7compare compare key to splitter 0.9 Samplelocalsort8 local radix sort of samples 5.0Culler 1997CS267 L28 Sort.18Bottom Line (Preview)N/Pus/key0.0020.0040.0060.0080.00100.00120.00140.001638432768655361310722621445242881048576Bitonic 1024Bitonic 32Column 1024Column 32Radix 1024Radix 32Sample 1024Sample 32•Good fit between predicted and measured

View Full Document

Berkeley COMPSCI C267 - LogP and the Implementation and Modeling of Parallel Sorts

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-22-41-42-43 out of 43 pages.

Berkeley COMPSCI C267 - LogP and the Implementation and Modeling of Parallel Sorts

Sign up for free to view:

Please select your school