Unformatted text preview:

Pipelining vs. Parallel processingExploiting ParallelismExploiting Parallelism at the Instruction levelSlide 4Exploiting Parallelism at the Instruction level (SIMD)Intel SSE/SSE2 as an example of SIMDIs it always that easy?We first need to restructure the codeThen we can write SIMD code for the hot partThread level parallelism: Multi-Core ProcessorsMulti-Cores are EverywhereWhy Multi-cores Now?… and performance growing too…As programmers, do we care?What if we want a program to run on both processors?Fork/Join Logical ExampleHow does this help performance?Reason #1: Amdahl’s LawSlide 19Reason #2: OverheadProgramming Explicit Thread-level ParallelismPerformance OptimizationSummary1Pipelining vs. Parallel processingIn both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unitParallel processing: each thing is processed entirely by a single functional unitWe will briefly introduce the key ideas behind parallel processing—instruction level parallelism—thread-level parallelismAfter Thanksgiving, we will look at how the hardware handles this2Exploiting ParallelismOf the computing problems for which performance is important, many have inherent parallelismBest example: computer games—Graphics, physics, sound, AI etc. can be done separately—Furthermore, there is often parallelism within each of these:•Each pixel on the screen’s color can be computed independently•Non-contacting objects can be updated/simulated independently•Artificial intelligence of non-human entities done independentlyAnother example: Google queries—Every query is independent —Google is read-only!!3Exploiting Parallelism at the Instruction levelConsider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a time4Exploiting Parallelism at the Instruction levelConsider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a time5Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Exploiting Parallelism at the Instruction level (SIMD)+Operate on MULTIPLE elements++Single Instruction,Multiple Data (SIMD)6Intel SSE/SSE2 as an example of SIMD•Added new 128 bit registers (XMM0 – XMM7), each can store—4 single precision FP values (SSE) 4 * 32b—2 double precision FP values (SSE2) 2 * 64b—16 byte values (SSE2) 16 * 8b—8 word values (SSE2) 8 * 16b—4 double word values (SSE2) 4 * 32b—1 128-bit integer value (SSE2) 1 * 128b 4.0 (32 bits)+ 4.0 (32 bits) 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits)-1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits)2.5 (32 bits)7Is it always that easy?Not always… a more challenging example:unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total;}Is there parallelism here?8We first need to restructure the codeunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}9Then we can write SIMD code for the hot partunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}10Thread level parallelism: Multi-Core ProcessorsTwo (or more) complete processors, fabricated on the same silicon chipExecute instructions from two (or more) programs/threads at same time#1 #2IBM Power511Multi-Cores are EverywhereIntel Core Duo in new Macs: 2 x86 processors on same chipXBox360: 3 PowerPC coresSony Playstation 3: Cell processor, an asymmetric multi-core with 9 cores (1 general-purpose, 8 special purpose SIMD processors)12Why Multi-cores Now?Number of transistors we can put on a chip growing exponentially…13… and performance growing too…But power is growing even faster!!—Power has become limiting factor in current chips14What happens if we run a program on a multi-core?voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; }}As programmers, do we care?#1 #215What if we want a program to run on both processors?We have to explicitly tell the machine exactly how to do this—This is called parallel programming or concurrent programmingThere are many parallel/concurrent programming models—We will look at a relatively simple one: fork-join parallelism—In CS241, you learn about Posix threads and explicit synchronization161.Fork N-1 threads2.Break work into N pieces (and do it)3.Join (N-1) threadsvoidarray_add(int A[], int B[], int C[], int length) {cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) { C[i] = A[i] + B[i]; }join();}Fork/Join Logical ExampleHow good is this with caches?17How does this help performance?Parallel speedup measures improvement from parallelization: time for best serial version time for version with p processorsWhat can we realistically expect?speedup(p) =18In general, the whole computation is not (easily) parallelizableReason #1: Amdahl’s LawSerial regions19Suppose a program takes 1 unit of time to execute seriallyA fraction of the program, s, is inherently serial (unparallelizable)For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system?What is the maximum speedup from parallelization? Reason #1: Amdahl’s LawNew Execution Time=1-s+ sPNew Execution Time=.9T+ .1T =3Speedup


View Full Document

U of I CS 232 - Lecture notes

Documents in this Course
Goal

Goal

2 pages

Exam 1

Exam 1

5 pages

Exam 1

Exam 1

6 pages

Exam 2

Exam 2

6 pages

Exam 1

Exam 1

5 pages

Load more
Download Lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?