DOC PREVIEW
CMU CS 15499 - Lecture

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-499: Parallel Algorithms Lecturer: Guy BlellochTopic: Introduction Date: Jan 13, 2009Scribe: Kanat, edited by Harsha Simhadri1.1 Parallelism is ubiquitousParallelism exists at most levels in computers from logic gates up to the level of the internet. Parallelismcan be found in:• Logic gates: Implementing any instruction to the processor involves simultaneous operations at mil-lions of logic gates.• Pipelining: A processor typically executes several instructions at a time to make efficient use of allthe functional units on the chip.• MMX/Vector instructions: Some chips, especially those built for graphics processing, are designed todo vector arithmetic using paralelly.• Hyperthreading: Running multiple threads on the same core to use multiple functional units and hidelatency.• Multi-cores.• Shared memory systems.• Clusters.• Internet.This course primarily deals with parallelism above the shared memory systems level and places emphasison parallelism in multi-core systems.1.2 The future of the chipProcessor design has already hit the limit in terms of clock speed and increasing the number of processingunts is the only evident method to increase processor performance. The number of cores per processor mightsoon go in to the hundreds thus requiring tha programs exhibit a high degree of parallelism.Parallelism also helps increase the performance to power ratio. Decreasing the clock speed and increasingthe number of cores in current processors beats increasing clock speeds in terms of the performance-powerratios by significant margins. This makes for the case of using parallelism as a tool to build chips for lowpower portable devices in addition to high performance computing where it is currently used.11.3 The sequential machine modelWhile thinking of sequential algorithms, we normally use the RAM model. In this model, we assume thatthe machine executes one instruction in a time step and can access any memory location with in the timestep. While this does not model real world systems which have large non-uniform memory access latencies,this model helps simplify algorithm design and analysis. In the next lecture, we will look at possible choicesfor parallel machine models.1.4 Finding ParallelismLet us look at how two common algorithmic problems can be parallelised.1.4.1 Matrix MultiplyTo multiply two n × n size matrices A, B, a normal sequential program would do:1: for i = 1 to n do2: for j = 1 to n do3: Cij= 0;4: for k = 1 to n do5: Cij= Cij+ Aik· Bkj;6: end for7: end for8: end forThis algorithm would require O(n3) time for execution, because each of the 3 loops are executed sequen-tially. However, we note that the two outer loops can be completely parallelised as computations for dif-ferent Cijs are completely independent. We can thus run n2independent parallel loops, one for each Cij,1 ≤ i, j ≤ n. This would still mean that we need atleast O(n) steps to multiply A and B even if we havearbitrarily large parallel computing power.We now go on to note that Cijis the vector dot product of row Aiand column Bjand the inner most loopof the above algorithm is just one way to implement a vector dot product operation. However, we can makeuse of the associativity of the addition operation to implement the dot product much faster. To computeai1b1j+ ai2b2j+ · · · + ainbnj, instead of using the order (((ai1b1j+ ai2b2j) + ai3b3j) + · · · + ainbnj), wecan use the recursive addition order indicated in fig. 1.4.1. The numbers aikbkj, 1 ≤ k ≤ n form the leavesof the tree. The value of each of the nodes can be computed recursively by summing up the values of itschildren. This recursive computation takes only O(log n) steps if given enough computing power. Thus, wehave a parallel algorithm that can multiply two matrices in logarithmic time.2a1a2a3a4a5a6a7a8+++++++Figure 1.4.1: A perfect binary tree for summing n = 8 numbers.1.4.2 Quick sortQuick sort is a popular divide-and-conquer, comparison-based sorting algorithm. When the pivots are cho-sen uniformly at random, we know that quick sort makes O(n log n) comparisons with high probability1.Below is a pseudo-code for sequential quick sort:QSort(A) =if |A| ≤ 1 return Aelsep = A[rand(|A|)];return QSort({x ∈ A : x < p})++{p}++QSort({x ∈ A : x > p});We see that the two recursive calls made at each level level of the recursion are independent and can bedone in parallel. Assuming that quick sort always splits the array in half every time, this gives us a parallelalgorithm that takes O(n) steps (obtained by solving the recursion T (n) = T (n/2) + n, n steps beingrequired for partitioning elements larger than p from elements smaller than p at each level of recursion).With a more careful analysis, we can show that this vesion of randomized quick sort takes O(n) time tocomplete with high probability.After all this is not so impressive—having infinite processors only yields log n speed up. When we examinethis code further, we see that we could potentially parallelize the partitioning routine. As we will see laterin the course, partitioning (i.e., constructing the array {x ∈ A : x < p}) can be accomplished in parallelin time time O(log n) (given enough computing power). This will give us a version of parallel quick sortwhich completes in O(log2n) time steps with high probability.1An event E occurs with high probability if Pr[E ] ≥ 1 − O(n−c) for some constant c ≥


View Full Document

CMU CS 15499 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?