U of I CS 232 - Performance Optimization, cont. - D1703391

Home> Schools> University of Illinois> Computer Science (CS) > CS 232> Performance Optimization, cont.

DOC PREVIEW

U of I CS 232 - Performance Optimization, cont.

School name University of Illinois

Course Cs 232- Computer Architecture II

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

November 30, 2007 ISA's, Compilers, and Assembly 1Performance Optimization, cont.How do we fix performance problems?November 30, 2007 ISA's, Compilers, and Assembly 2How do we improve performance? Imagine you want to build a house. How long would it take you? What could you do to build that house faster?November 30, 2007 ISA's, Compilers, and Assembly 3Exploiting Parallelism Of the computing problems for which performance is important, manyhave inherent parallelism. E.g., computer games:— graphics, physics, sound, A.I. etc. can be done separately— Furthermore, there is often parallelism within each of these:• Each pixel on the screen’s color can be computed independently• Non-contacting objects can be updated/simulated independently• Artificial intelligence of non-human entities done independently E.g., Google queries:— Every query is independent• Google searches are read-only!!November 30, 2007 ISA's, Compilers, and Assembly 4Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }} You could write assembly for this, something like:lw $t0, 0($a0)lw $t1, 0($a1)add $t0, $t1, $t2sw $t2, 0($a2)(plus all of the address arithmetic, plus the loop control)November 30, 2007 ISA's, Compilers, and Assembly 5Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a timeNovember 30, 2007 ISA's, Compilers, and Assembly 6Exploiting Parallelism at the Instruction level (SIMD) Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Operating on one element at a timeNovember 30, 2007 ISA's, Compilers, and Assembly 7 Consider adding together two arrays:voidarray_add(int A[], int B[], int C[], int length) { int i; for (i = 0 ; i < length ; ++ i) { C[i] = A[i] + B[i]; }}+Exploiting Parallelism at the Instruction level (SIMD)+Operate on MULTIPLE elements+ +Single Instruction,Multiple Data (SIMD)November 30, 2007 ISA's, Compilers, and Assembly 8Intel SSE/SSE2 as an example of SIMD• Added new 128 bit registers (XMM0 – XMM7), each can store— 4 single precision FP values (SSE) 4 * 32b— 2 double precision FP values (SSE2) 2 * 64b— 16 byte values (SSE2) 16 * 8b— 8 word values (SSE2) 8 * 16b— 4 double word values (SSE2) 4 * 32b— 1 128-bit integer value (SSE2) 1 * 128b 4.0 (32 bits)+ 4.0 (32 bits) 3.5 (32 bits) -2.0 (32 bits) 2.3 (32 bits) 1.7 (32 bits) 2.0 (32 bits)-1.5 (32 bits) 0.3 (32 bits) 5.2 (32 bits) 6.0 (32 bits)2.5 (32 bits)November 30, 2007 ISA's, Compilers, and Assembly 9SIMD ExtensionsMore than 70 instructions. Arithmetic Operations supported:Addition, Subtraction, Mult, Division, Square Root, Maximum,Minimum. Can operate on Floating point or Integer data.November 30, 2007 ISA's, Compilers, and Assembly 10Annotated SSE code for summing an arraymovdqa (%eax,%edx,4), %xmm0 # load A[i] to A[i+3]movdqa (%ebx,%edx,4), %xmm1 # load B[i] to B[i+3]paddd %xmm0, %xmm1 # CCCC = AAAA + BBBBmovdqa %xmm1, (%ecx,%edx,4) # store C[i] to C[i+3]addl $4, %edx # i += 4(loop control code)mov = data movementdq = double-quad (128b)a = aligned%eax = A%ebx = B%ecx = C%edx = iA + 4*ip = packedadd = addd = double (i.e., 32-bit integer)why?November 30, 2007 ISA's, Compilers, and Assembly 11November 30, 2007 ISA's, Compilers, and Assembly 12Is it always that easy? No. Not always. Let’s look at a little more challenging one.unsignedsum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total;} Is there parallelism here?November 30, 2007 ISA's, Compilers, and Assembly 13Exposing the parallelismunsignedsum_array(unsigned *array, int length) { int total = 0; for (int i = 0 ; i < length ; ++ i) { total += array[i]; } return total;}November 30, 2007 ISA's, Compilers, and Assembly 14We first need to restructure the codeunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}November 30, 2007 ISA's, Compilers, and Assembly 15Then we can write SIMD code for the hot partunsignedsum_array2(unsigned *array, int length) { unsigned total, i; unsigned temp[4] = {0, 0, 0, 0}; for (i = 0 ; i < length & ~0x3 ; i += 4) { temp[0] += array[i]; temp[1] += array[i+1]; temp[2] += array[i+2]; temp[3] += array[i+3]; } total = temp[0] + temp[1] + temp[2] + temp[3]; for ( ; i < length ; ++ i) { total += array[i]; } return total;}November 30, 2007 ISA's, Compilers, and Assembly 16Summary Performance is of primary concern in some applications— Games, servers, mobile devices, super computers Many important applications have parallelism— Exploiting it is a good way to speed up programs. Single Instruction Multiple Data (SIMD) does this at ISA level— Registers hold multiple data items, instruction operate on them— Can achieve factor or 2, 4, 8 speedups on kernels— May require some restructuring of code to expose

View Full Document