Pipelining vs. Parallel processingExploiting ParallelismExploiting Parallelism at the Instruction levelSlide 4Exploiting Parallelism at the Instruction level (SIMD)Intel SSE/SSE2 as an example of SIMDIs it always that easy?We first need to restructure the codeThen we can write SIMD code for the hot partThread level parallelism: Multi-Core ProcessorsMulti-Cores are EverywhereWhy Multi-cores Now?… and performance growing too…As programmers, do we care?What if we want a program to run on both processors?Fork/Join Logical ExampleHow does this help performance?Reason #1: Amdahl’s LawSlide 19Reason #2: OverheadProgramming Explicit Thread-level ParallelismPerformance OptimizationSummary1Pipelining vs. Parallel processingIn both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unitParallel processing: each thing is processed entirely by a single functional unitWe will briefly introduce the key ideas behind parallel processing—instruction level parallelism—thread-level parallelismAfter Thanksgiving, we will look at how the hardware handles this2Exploiting ParallelismOf the computing problems for which performance is important, many have inherent parallelismBest example: computer games—Graphics, physics, sound, AI etc. can be done separately—Furthermore, there is often parallelism within each of these:•Each pixel on the screen’s color can be computed independently•Non-contacting objects can be updated/simulated independently•Artificial intelligence of non-human entities done independentlyAnother example: Google queries—Every query is independent —Google is read-only!!3Exploiting Parallelism at the Instruction levelConsider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a time4Exploiting Parallelism at the Instruction levelConsider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a time5Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Exploiting Parallelism at the Instruction level (SIMD)+Operate on MULTIPLE elements++Single Instruction,Multiple Data (SIMD)6Intel SSE/SSE2 as an example of SIMD•Added new 128 bit registers (XMM0 – XMM7), each can store—4 single precision FP values (SSE) 4 * 32b—2 double precision FP values (SSE2) 2 * 64b—16 byte values (SSE2) 16 * 8b—8 word values (SSE2) 8 * 16b—4 double word values (SSE2) 4 * 32b—1 128-bit integer value (SSE2) 1 * 128b 4.0 (32 bits)+ 4.0 (32 bits) 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits)-1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits)2.5 (32 bits)7Is it always that easy?Not always… a more challenging example:unsigned sum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total;}Is there parallelism here?8We first need to restructure the codeunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}9Then we can write SIMD code for the hot partunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}10Thread level parallelism: Multi-Core ProcessorsTwo (or more) complete processors, fabricated on the same silicon chipExecute instructions from two (or more) programs/threads at same time#1 #2IBM Power511Multi-Cores are EverywhereIntel Core Duo in new Macs: 2 x86 processors on same chipXBox360: 3 PowerPC coresSony Playstation 3: Cell processor, an asymmetric multi-core with 9 cores (1 general-purpose, 8 special purpose SIMD processors)12Why Multi-cores Now?Number of transistors we can put on a chip growing exponentially…13… and performance growing too…But power is growing even faster!!—Power has become limiting factor in current chips14What happens if we run a program on a multi-core?voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++i) { C[i] = A[i] + B[i]; }}As programmers, do we care?#1 #215What if we want a program to run on both processors?We have to explicitly tell the machine exactly how to do this—This is called parallel programming or concurrent programmingThere are many parallel/concurrent programming models—We will look at a relatively simple one: fork-join parallelism—In CS241, you learn about Posix threads and explicit synchronization161.Fork N-1 threads2.Break work into N pieces (and do it)3.Join (N-1) threadsvoidarray_add(int A[], int B[], int C[], int length) {cpu_num = fork(N-1); int i; for (i = cpu_num ; i < length ; i += N) { C[i] = A[i] + B[i]; }join();}Fork/Join Logical ExampleHow good is this with caches?17How does this help performance?Parallel speedup measures improvement from parallelization: time for best serial version time for version with p processorsWhat can we realistically expect?speedup(p) =18In general, the whole computation is not (easily) parallelizableReason #1: Amdahl’s LawSerial regions19Suppose a program takes 1 unit of time to execute seriallyA fraction of the program, s, is inherently serial (unparallelizable)For example, consider a program that, when executing on one processor, spends 10% of its time in a non-parallelizable region. How much faster will this program run on a 3-processor system?What is the maximum speedup from parallelization? Reason #1: Amdahl’s LawNew Execution Time=1-s+ sPNew Execution Time=.9T+ .1T =3Speedup
View Full Document