DOC PREVIEW
U of I CS 232 - Introduction to SSE

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS232 Spring 2007 Discussion 13: SSEIntroduction to SSEA wide range of software applications, including graphics, MPEG video, music synthesis andmore, show many common fundamental characteristics:• Small data types (e.g., 8-bit pixels, 16-bit audio samples)• Small, highly repetitive loops• Computation-intensive algorithms• Frequent multiplies and accumulates• Highly parallel operationsThe Intel Streaming SIMD Extensions (SSE) comprise a set of extensions to the Intelx86 architecture that is designed to greatly enhance the performance of advanced media andcommunication applications.In class, you saw actual Intel SSE assembly instructions - however, these are not easy toprogram with (in general, assembly is not the language of choice for larger programs). Fortu-nately, some compilers will have built in functions that provide a one-to-one mapping to SSEassembly.Example: Inner Pro ductRecall that the inner product of two vectors x = (x1, x2, ..., xk) and y = (y1, y2, ..., yk) is definedas follows: x • y = x1y1+ x2y2+ ... + xkyk. Normally, we could compute the inner product asfollows:float x[k]; float y[k]; // operand vectors of length kfloat inner_product = 0.0; // accumulatorfor (int i = 0; i < k; i++)inner_product += x[i] * y[i];To take advantage of SSE operations, we can rewrite this code using the SSE intrinsics:typedef float v4sf __attribute__ ((mode(V4SF))); // floating point vector typefloat x[k]; float y[k]; // operand vectors of length kfloat inner_product = 0.0, temp[4];v4sf acc, X, Y; // 4x32-bit float registersacc = __builtin_ia32_xorps(acc, acc); // zero the accumulatorfor (int i = 0; i < (k - 3); i += 4) {X = __builtin_ia32_loadups(&x[i]); // load groups of four floatsY = __builtin_ia32_loadups(&y[i]);acc = __builtin_ia32_addps(acc, __builtin_ia32_mulps(X, Y));}__builtin_ia32_storeups(temp, acc); // add the accumulated valuesinner_product = temp[0] + temp[1] + temp[2] + temp[3];for (; i < k; i++) // add up the remaining floatsinner_product += x[i] * y[i]);1CS232 Spring 2007 Discussion 13: SSEProblems1. Matrix-Vector MultiplicationWrite a function mv multiply that multiplies a matrix and a vector. Recall that if A is ak×k matrix, B is a k-vector, and A∗B = C, then C is a k-vector where Ci= Σkj=1Ai,j∗Bj.Code without SSE intrinsics is on the web site (under section 13). The original shouldhave a run-time of around 9 seconds, while a basic solution using the SSE intrinsics shouldrun in about 4 seconds.2. Mandelbrot SetsWrite a function that determines whether a series of points in a complex plane are insidethe Mandelbrot set. Let fc(z) = z2+ c. Let fnc(z) be the results of composing fc(z)with itself n times. (So fnc(z) = fn−1c(fc(z)) and f1c(z) = fc(z).) Then, a point (x, y) isconsidered to be in the Mandelbrot set, if for a complex number c = x + yi, fnc(0) doesnot diverge to infinity as n approaches infinity.Code without SSE intrinsics is on the web site (under section 13). The original shouldhave a run-time of around 10 seconds, while a basic solution using the SSE intrinsicsshould run in about 5 seconds.The code makes a simplifying assumption that if |f200c(0)| < 4, then it does not divergeto infinity.Useful Intrinsicsv4sf __builtin_ia32_loadups(float *)void __builtin_ia32_storeups(float *, v4sf)v4sf __builtin_ia32_addps(v4sf, v4sf) // parallel arithmetic opsv4sf __builtin_ia32_subps(v4sf, v4sf)v4sf __builtin_ia32_mulps(v4sf, v4sf)v4sf __builtin_ia32_divps(v4sf, v4sf)v4sf __builtin_ia32_xorps(v4sf, v4sf) // 128-bit XOR// Can be used to quickly generate 0sNotes1. Compile your code with the command: gcc -Wall -o execfile -msse filename.c2. Run your program using: ./execfile3. For further details about SSE intrinsics and the gcc compiler’s implementation of


View Full Document

U of I CS 232 - Introduction to SSE

Documents in this Course
Goal

Goal

2 pages

Exam 1

Exam 1

5 pages

Exam 1

Exam 1

6 pages

Exam 2

Exam 2

6 pages

Exam 1

Exam 1

5 pages

Load more
Download Introduction to SSE
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction to SSE and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction to SSE 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?