DOC PREVIEW
MIT 6 854 - Parallel Algorithms

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Parallel AlgorithmsTwo closely related models of parallel computation.Circuits• Logic gates (AND/OR/not) connected by wires• important measures– number of gates– depth (clock cycles in synchronous circuit)PRAM• P processors, each with a RAM, local registers• global memory of M locations• each processor can in one step do a RAM op or read/write to one global memorylocation• synchronous parallel steps• not realistic, but explores “degree of parallelism”Essntially the same models, but let us focus on different things.Circuits• Logic gates (AND/OR/not) connected by wires• important measures– number of gates– depth (clock cycles in synchronous circuit)• bounded vs unbounded fan-in/out• AC(k) and NC(k): unbounded and bounded fan-in with depth O(logkn) for problemsof size n• AC(k) ⊂ NC(k) ⊂ AC(k + 1) using full binary tree• NC = ∪NC(k) = ∪AC(k)Addition• consider adding aiand biwith carry ci−1to produce output siand next carry ci• Ripple carry: O(n) gates, O(n) time1• Carry lookahead: O(n) gates, O(log n) time• preplan for late arrival of ci.• given aiand bi, three possible cases for ci– if ai= bi, then ci= aidetermined without ci−1: generate c1= 1 or kill ci= 0– otherwise, propogate ci= ci−1– write xi= k, g, p accordingly• consider 3 ×3 “multiplication table” for effect of two adders in a row. pair propagatesprevious carry only if both propagate.xik p gk k k gxi−1p k p gg k g g• Now let y0= k, yi= yi−1× xi• constraints “multiplication table” by inductionxik p gk k k gyi−1p k never gg k g g• conclude: yi= k means ci= 0, yi= g means ci= 1, and yi= p never happens• so problem reduced to computing all yiin parallelParallel prefix• Build full binary tree• two gates at each node• pass up product of all children• pass down product of all xipreceding leftmost child• works for any associative functionPRAMvarious conflict resolutions (CREW, EREW, CRCW)• CRCW (k) ⊂ EREW (k + 1)• NC = ∪CRCW (k)PRAMs simulate circuits, and vice versa2• So NC well-defineddifferences in practice• EREW needs log n to find max (info theory lower bound)• CRCW finds max in constant time with n2processors– Compare every pair– If an item loses, write “not max” in its entry– Check all entries– If item is max (not overwritten), write its value in answer• in O(log log n) time with n processors– Suppose k items remain– Make k2/n blocks of n/k items– quadratic time max for each: (k2/n)(n/k)2= n processors total– recurrence: T (k) = 1 + T (k2/n)– T (n/2i) = 1 + T (n/22i)– so log log n iters.parallel prefix• using n processorslist ranking EREW• next pointers n (x)• d(x)+ = d(n(x)); n(x) = n(n(x)).• by induction, sum of values on path to end doesn’t change0.1 Work-Efficient AlgorithmsIdea:• We’ve seen parallel algorithms that are somewhat “inefficient”• do more total work (processors times time) than sequential• Ideal solution: arrange for total work to be proportional to best sequential work• Work-Efficient Algorithm• Then a small number of processors (or even 1) can “simulate” many processors in afully efficient way3• Parallel analogue of “cache oblivious algorithm”—you write algorithm once for manyprocessors; lose nothing when it gets simulated on fewer.Brent’s theorem• Different perspective on work: count number of processors actually working in eachtime step.• If algorithm does x total work and critical path t• Then p processors take x/p + t time• So, if use p = x/t processors, finish in time t with efficient algorithmWork-efficient parallel prefix• linear sequential work• going to need log n time• so, aim to get by with n/ log n processors• give each processor a block of log n items to add up• reduces problem to n/ log n values• use old algorithm• each processor fixes up prefixes for its blockWork-efficient list ranking• harder: can’t easily give contiguous “blocks” of log n to each processor (requires listranking)• However, assume items in arbitrary order in array of log n structs, so can give log ndistinct items to each processor.• use random coin flips to knock out “alternate” items• shortcut any item that is heads and has tails successor• requires at most one shortcut• and constant probability every other item is shortcut (and indepdendent)• so by chernoff, 1/16 of items are shortcut out• “compact” remmaining items into smaller array using parallel prefix on array of point-ers (ignoring list structure) to collect only “marked” nodes and update their pointers• let each proces soor handle log n (arbitrary) items4• O(n/ log n) processors, O(log n) time• After O(log log n) rounds, number of items reduced to n/ log n• use old algorithm• result: O(log n log log n) time, n/ log n processors• to improve, use faster “compaction” algorithm to collect marked nodes: O(log log n)time randomized, or O(log n/ log log n) deterministic. get optimal alg.• How about deterministic algorithm? Use “deterministic coin tossing”• take all local maxima as part of ruling set.Euler tour to reduce to parallel prefix for computing depth in tree.• work efficientExpression Tree Evaluation: plus and times nodesGeneralize problem:• Each tree edge has a label (a, b)• meaning that if subtree below evaluates to y then value (ay + b) should be passed upedgeIdea: pointer jumping• prune a leaf• now can pointer-jump parent• problem: don’t want to disconnect tree (need to feed all data to root!)• solution: number leaves in-orde• three step process:– shunt odd-numbered left-child leaves– shunt odd-number right-child leaves– divide leaf-numbers by 2Consider a tree fragment• method for eliminating all left-child leaves• root q with left child p (product node) on edge labeled (a3, b3)• p has left child edge (a1, b1) leaf ` with value v• right child edge to s with label (a2, b2)5• fold out p and `, make s a child of q• what label of new edge?• prepare for s subtree to eval to y.• choose a, b such that ay + b = a3· [(a1v + b1) · (a2y + b2)] + b30.2 SortingCREW Merge sort:• merge to length-k sequences using n processors• each element of first seq. uses binary search to find place in second• so knows how many items smaller• so knows rank in merged sequence: go there• then do same for second list• O(log k) time with n processors• total time O(Pi≤lg nlog


View Full Document

MIT 6 854 - Parallel Algorithms

Download Parallel Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallel Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?