CS232 Fall 2006 VTune Lab November 13, 2006GoalThe goal of this section is to get you familiar with the VTune analyzer for Performance Optimization.Recall from lecture that the most efficient methodology for optimizing programs is as follows:1. Write a working version of the program.2. Create suitable benchmarks and collect data for them.3. Identify “hotspots” and fine-tune those areas of the code.4. If the performance objectives have not been met, go back to Step 3.In this section, you will implement this methodology. A working version of the code (p1.c) and a suitablebenchmark (datafile.txt) will be provided. Your task is to use VTune to identify hotspots and determinehow to optimize the code. Do not attempt to read and understand the whole code. Instead, focus onlyon the slow segments (or functions) identified by VTune.Getting started1. Please seat yourself at one of the machines csil-core1 through csil-core25. If you can’t find afree machine, work in groups of two.2. Open a terminal and type the following commands:mkdir vtunecd vtunecp ~kumar/vtune/* .//opt/intel/vtune/bin/vtlec &The last command will start the VTune analyzer in the background. Make sure your current directorycontains the following files: p1.c, p2.c, p3.c, datafile.txt and Makefile.3. If something isn’t working, let us know. Otherwise, you can start working on Problem 1.Problem 1a: Identifying the “Hotspot”Compile the file p1.c with the command make p1. This will create an executable file which you can runwith the command p1. The program takes about 4 seconds to run.Using VTune:1. Click “Tuning”→“New Activity” and seclect the First Use Wizard. Click “Next”.2. For setting the Application to launch, click “Browse” and select the vtune/p1 program.3. Set the Working di rectory by clicking “Browse” and selecting Home followed by vtune.4. Click “Finish” to run the First Use Activity. Do not click anything until the “Most ActiveFunctions in Your Application” tab is displayed.As you can see, the SortArray function takes up almost all the time. Click on the “SortArray” link toview the source code. Scroll down until you have identified the slowest code-segment within this function(refer to the “Clockticks” column to see where the code is spending most of its time).1CS232 Fall 2006 VTune Lab November 13, 2006Problem 1b: The Call GraphA quick check should convince you that p1.c spends mos t of its time in the inner loop of the bubble-sort.As you know, there are faster sorting algorithms (e.g. QuickSort) – should we implement one of these?Before jumping to this conclusion, it is worth realizing that there could be two reasons why SortArray takesso long: (a) the function itself is slow (in which case switching to QuickSort could help), or (b) SortArrayis being called a large number of times (in which case, the caller must be examined more carefully).To determine which is the case, analyze p1 with the Call Graph Wizard. This produces a graphical viewof where the code is spending most of its time. In particular, it identifies the sequence of function calls,known as the critical path (shown in red), along which the bulk of the time is spent.First examine the critical path to determine which calls to SortArray are expensive. Now examine thecode of the caller. You will see that it is using a very poor algorithm to achieve its task. Suggest a simple,faster algorithm (no need to implement it).Problem 2: Better Data StructureThe code in p2.c implements a simple algorithmic fix for the hotspot in Problem 1.1Our goal in thisproblem is to compare two data structures used to represent sorted data. Compile p2.c with the commandmake p2 and run the First Use Wizard again (this time, select p2 as the Application to launch). Explainyour results. (Hint: Caches, locality.)Problem 3: Better InstructionsFinally, we will rewrite the code by identifying slow instructions and replacing them with faster ones.Compile the code in p3.c2, and this time run the Sampling Wizard. When the activity is complete, clickon “Clockticks” to examine the slowest function by clockticks. Click on the name of the slowest function toexamine its source code. To view the corrsponding x86 assembly instructions, click the “Mixed by Source”button above the “Line Number” column.Because modern processors reorder assembly instructions in order to achieve high performance, theclock-ticks per instruction may not be accurate. However, VTune is able to correctly identify the slowestblock of code. Rewrite this block to achieve improved performance. Explain w hy you made the changesyou did:1This speeds the code so dramatically, in fact, that we must run it several time in order to get representative results! Forp2.c, the constant ITERS has been increased from 2 to 64.2For p3.c, we have removed the expensive calls to AddToArray and AddToLinkedList. Take CS 225/CS 473 to learn efficientways of maintaining sorted
View Full Document