CS232 Spring 2007 Discussion 13: SSEIntroduction to SSEA wide range of software applications, including graphics, MPEG video, music synthesis andmore, show many common fundamental characteristics:• Small data types (e.g., 8-bit pixels, 16-bit audio samples)• Small, highly repetitive loops• Computation-intensive algorithms• Frequent multiplies and accumulates• Highly parallel operationsThe Intel Streaming SIMD Extensions (SSE) comprise a set of extensions to the Intelx86 architecture that is designed to greatly enhance the performance of advanced media andcommunication applications.In class, you saw actual Intel SSE assembly instructions - however, these are not easy toprogram with (in general, assembly is not the language of choice for larger programs). Fortu-nately, some compilers will have built in functions that provide a one-to-one mapping to SSEassembly.Example: Inner Pro ductRecall that the inner product of two vectors x = (x1, x2, ..., xk) and y = (y1, y2, ..., yk) is definedas follows: x • y = x1y1+ x2y2+ ... + xkyk. Normally, we could compute the inner product asfollows:float x[k]; float y[k]; // operand vectors of length kfloat inner_product = 0.0; // accumulatorfor (int i = 0; i < k; i++)inner_product += x[i] * y[i];To take advantage of SSE operations, we can rewrite this code using the SSE intrinsics:typedef float v4sf __attribute__ ((mode(V4SF))); // floating point vector typefloat x[k]; float y[k]; // operand vectors of length kfloat inner_product = 0.0, temp[4];v4sf acc, X, Y; // 4x32-bit float registersacc = __builtin_ia32_xorps(acc, acc); // zero the accumulatorfor (int i = 0; i < (k - 3); i += 4) {X = __builtin_ia32_loadups(&x[i]); // load groups of four floatsY = __builtin_ia32_loadups(&y[i]);acc = __builtin_ia32_addps(acc, __builtin_ia32_mulps(X, Y));}__builtin_ia32_storeups(temp, acc); // add the accumulated valuesinner_product = temp[0] + temp[1] + temp[2] + temp[3];for (; i < k; i++) // add up the remaining floatsinner_product += x[i] * y[i]);1CS232 Spring 2007 Discussion 13: SSEProblems1. Matrix-Vector MultiplicationWrite a function mv multiply that multiplies a matrix and a vector. Recall that if A is ak×k matrix, B is a k-vector, and A∗B = C, then C is a k-vector where Ci= Σkj=1Ai,j∗Bj.Code without SSE intrinsics is on the web site (under section 13). The original shouldhave a run-time of around 9 seconds, while a basic solution using the SSE intrinsics shouldrun in about 4 seconds.2. Mandelbrot SetsWrite a function that determines whether a series of points in a complex plane are insidethe Mandelbrot set. Let fc(z) = z2+ c. Let fnc(z) be the results of composing fc(z)with itself n times. (So fnc(z) = fn−1c(fc(z)) and f1c(z) = fc(z).) Then, a point (x, y) isconsidered to be in the Mandelbrot set, if for a complex number c = x + yi, fnc(0) doesnot diverge to infinity as n approaches infinity.Code without SSE intrinsics is on the web site (under section 13). The original shouldhave a run-time of around 10 seconds, while a basic solution using the SSE intrinsicsshould run in about 5 seconds.The code makes a simplifying assumption that if |f200c(0)| < 4, then it does not divergeto infinity.Useful Intrinsicsv4sf __builtin_ia32_loadups(float *)void __builtin_ia32_storeups(float *, v4sf)v4sf __builtin_ia32_addps(v4sf, v4sf) // parallel arithmetic opsv4sf __builtin_ia32_subps(v4sf, v4sf)v4sf __builtin_ia32_mulps(v4sf, v4sf)v4sf __builtin_ia32_divps(v4sf, v4sf)v4sf __builtin_ia32_xorps(v4sf, v4sf) // 128-bit XOR// Can be used to quickly generate 0sNotes1. Compile your code with the command: gcc -Wall -o execfile -msse filename.c2. Run your program using: ./execfile3. For further details about SSE intrinsics and the gcc compiler’s implementation of
View Full Document