DOC PREVIEW
U of I CS 232 - Performance Optimization, cont.

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

November 30, 2007 ISA's, Compilers, and Assembly 1Performance Optimization, cont.How do we fix performance problems?November 30, 2007 ISA's, Compilers, and Assembly 2How do we improve performance? Imagine you want to build a house. How long would it take you? What could you do to build that house faster?November 30, 2007 ISA's, Compilers, and Assembly 3Exploiting Parallelism Of the computing problems for which performance is important, manyhave inherent parallelism. E.g., computer games:— graphics, physics, sound, A.I. etc. can be done separately— Furthermore, there is often parallelism within each of these:• Each pixel on the screen’s color can be computed independently• Non-contacting objects can be updated/simulated independently• Artificial intelligence of non-human entities done independently E.g., Google queries:— Every query is independent• Google searches are read-only!!November 30, 2007 ISA's, Compilers, and Assembly 4Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }} You could write assembly for this, something like:lw $t0, 0($a0)lw $t1, 0($a1)add $t0, $t1, $t2sw $t2, 0($a2)(plus all of the address arithmetic, plus the loop control)November 30, 2007 ISA's, Compilers, and Assembly 5Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a timeNovember 30, 2007 ISA's, Compilers, and Assembly 6Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a timeNovember 30, 2007 ISA's, Compilers, and Assembly 7 Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Exploiting Parallelism at the Instruction level (SIMD)+Operate on MULTIPLE elements+ +Single Instruction,Multiple Data (SIMD)November 30, 2007 ISA's, Compilers, and Assembly 8Intel SSE/SSE2 as an example of SIMD• Added new 128 bit registers (XMM0 – XMM7), each can store— 4 single precision FP values (SSE) 4 * 32b— 2 double precision FP values (SSE2) 2 * 64b— 16 byte values (SSE2) 16 * 8b— 8 word values (SSE2) 8 * 16b— 4 double word values (SSE2) 4 * 32b— 1 128-bit integer value (SSE2) 1 * 128b 4.0 (32 bits)+ 4.0 (32 bits) 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits)-1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits)2.5 (32 bits)November 30, 2007 ISA's, Compilers, and Assembly 9SIMD ExtensionsMore than 70 instructions. Arithmetic Operations supported:Addition, Subtraction, Mult, Division, Square Root, Maximum,Minimum. Can operate on Floating point or Integer data.November 30, 2007 ISA's, Compilers, and Assembly 10Annotated SSE code for summing an arraymovdqa (%eax,%edx,4), %xmm0 # load A[i] to A[i+3]movdqa (%ebx,%edx,4), %xmm1 # load B[i] to B[i+3]paddd %xmm0, %xmm1 # CCCC = AAAA + BBBBmovdqa %xmm1, (%ecx,%edx,4) # store C[i] to C[i+3]addl $4, %edx # i += 4(loop control code)mov = data movementdq = double-quad (128b)a = aligned%eax = A%ebx = B%ecx = C%edx = iA + 4*ip = packedadd = addd = double (i.e., 32-bit integer)why?November 30, 2007 ISA's, Compilers, and Assembly 11November 30, 2007 ISA's, Compilers, and Assembly 12Is it always that easy? No. Not always. Let’s look at a little more challenging one.unsignedsum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total;} Is there parallelism here?November 30, 2007 ISA's, Compilers, and Assembly 13Exposing the parallelismunsignedsum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total;}November 30, 2007 ISA's, Compilers, and Assembly 14We first need to restructure the codeunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}November 30, 2007 ISA's, Compilers, and Assembly 15Then we can write SIMD code for the hot partunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}November 30, 2007 ISA's, Compilers, and Assembly 16Summary Performance is of primary concern in some applications— Games, servers, mobile devices, super computers Many important applications have parallelism— Exploiting it is a good way to speed up programs. Single Instruction Multiple Data (SIMD) does this at ISA level— Registers hold multiple data items, instruction operate on them— Can achieve factor or 2, 4, 8 speedups on kernels— May require some restructuring of code to expose


View Full Document

U of I CS 232 - Performance Optimization, cont.

Documents in this Course
Goal

Goal

2 pages

Exam 1

Exam 1

5 pages

Exam 1

Exam 1

6 pages

Exam 2

Exam 2

6 pages

Exam 1

Exam 1

5 pages

Load more
Download Performance Optimization, cont.
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Performance Optimization, cont. and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Performance Optimization, cont. 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?