Unformatted text preview:

6 189 IAP 2007 Lecture 10 Performance Monitoring and Optimizations Dr Rodric Rabbah IBM 1 6 189 IAP 2007 MIT Review Keys to Parallel Performance Coverage or extent of parallelism in algorithm Amdahl s Law Granularity of partitioning among processors Communication cost and load balancing Locality of computation and communication Communication between processors or between processors and their memories Dr Rodric Rabbah IBM 2 6 189 IAP 2007 MIT Communication Cost Model total data sent number of messages n m C f o l t overlap B frequency of messages overhead per message at both ends network delay per message Dr Rodric Rabbah IBM cost induced by contention per message amount of latency hidden by concurrency with computation bandwidth along path determined by network 3 6 189 IAP 2007 MIT Overlapping Communication with Computation Get Data Memory is idle CPU is idle Compute synchronization point Get Data Compute Dr Rodric Rabbah IBM 4 6 189 IAP 2007 MIT Limits in Pipelining Communication Computation to communication ratio limits performance gains from pipelining Get Data Compute Get Data Compute Where else to look for performance Dr Rodric Rabbah IBM 5 6 189 IAP 2007 MIT Artifactual Communication Determined by program implementation and interactions with the architecture Examples Poor distribution of data across distributed memories Unnecessarily fetching data that is not used Redundant data fetches Dr Rodric Rabbah IBM 6 6 189 IAP 2007 MIT Lessons From Uniprocessors In uniprocessors CPU communicates with memory Loads and stores are to uniprocessors as get and put are to distributed memory multiprocessors How is communication overlap enhanced in uniprocessors Spatial locality Temporal locality Dr Rodric Rabbah IBM 7 6 189 IAP 2007 MIT Spatial Locality CPU asks for data at address 1000 Memory sends data at address 1000 1064 Amount of data sent depends on architecture parameters such as the cache block size Works well if CPU actually ends up using data from 1001 1002 1064 Otherwise wasted bandwidth and cache capacity Dr Rodric Rabbah IBM 8 6 189 IAP 2007 MIT Temporal Locality main memory Main memory access is expensive Memory hierarchy adds small but fast memories caches near the CPU cache level 2 Memories get bigger as distance from CPU increases cache level 1 CPU asks for data at address 1000 Memory hierarchy anticipates more accesses to same address and stores a local copy Works well if CPU actually ends up using data from 1000 over and over and over Otherwise wasted cache capacity Dr Rodric Rabbah IBM 9 6 189 IAP 2007 MIT Reducing Artifactual Costs in Distributed Memory Architectures Data is transferred in chunks to amortize communication cost Cell DMA gets up to 16K Usually get a contiguous chunk of memory Spatial locality Computation should exhibit good spatial locality characteristics Temporal locality Reorder computation to maximize use of data fetched Dr Rodric Rabbah IBM 10 6 189 IAP 2007 MIT 6 189 IAP 2007 Single Thread Performance the last frontier in the search for performance Dr Rodric Rabbah IBM 11 6 189 IAP 2007 MIT Single Thread Performance Tasks mapped to execution units threads Threads run on individual processors cores sequential sequential parallel parallel finish line sequential time longest parallel time Two keys to faster execution Load balance the work among the processors Make execution on each processor faster Dr Rodric Rabbah IBM 12 6 189 IAP 2007 MIT Understanding Performance Need some way of measuring performance define N 1 23 define T 10 include string h double a N b N Coarse grained measurements gcc sample c time a out 2 312u 0 062s 0 02 50 94 8 gcc sample c O3 time a out 1 921u 0 093s 0 02 03 99 0 void cleara double a N int i for i 0 i N i a i 0 int main double s 0 s2 0 int i j for j 0 j T j for i 0 i N i b i 0 cleara a memset a 0 sizeof a record start time for i 0 i N i s a i b i s2 a i a i b i b i but did we learn much about what s going on record stop time printf s f s2 f n s s2 Dr Rodric Rabbah IBM 13 6 189 IAP 2007 MIT Measurements Using Counters Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events Insert code to start read and stop counter Measure exactly what you want anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect counters can perturb execution time clear start stop time Dr Rodric Rabbah IBM 14 6 189 IAP 2007 MIT Dynamic Profiling Event based profiling Interrupt execution when an event counter reaches a threshold Time based profiling Interrupt execution every t seconds Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for appropriate sampling frequencies Dr Rodric Rabbah IBM 15 6 189 IAP 2007 MIT Counter Examples Cycles clock ticks Pipeline stalls Cache hits Cache misses Number of instructions Number of loads Number of stores Number of floating point operations Dr Rodric Rabbah IBM 16 6 189 IAP 2007 MIT Useful Derived Measurements Processor utilization Cycles Wall Clock Time Instructions per cycle Instructions Cycles Instructions per memory operation Instructions Loads Stores Average number of instructions per load miss Instructions L1 Load Misses Memory traffic Loads Stores Lk Cache Line Size Bandwidth consumed Loads Stores Lk Cache Line Size Wall Clock Time Many others Cache miss rate Branch misprediction rate Dr Rodric Rabbah IBM 17 6 189 IAP 2007 MIT Common Profiling Workflow application application source source compiler binary binary object objectcode code binary binaryanalysis analysis run run profiles profiles execution execution interpret interpretprofile profile performance performance profile profile source source correlation correlation Dr Rodric Rabbah IBM 18 6 189 IAP 2007 MIT Popular Runtime Profiling Tools GNU gprof Widely available with UNIX Linux distributions gcc O2 pg foo c o foo foo gprof foo HPC Toolkit http www hipersoft rice edu hpctoolkit PAPI http icl cs utk edu papi VTune http www intel com cd software products asmo na eng vtune Many others Dr Rodric Rabbah IBM 19 6 189 IAP 2007 MIT GNU gprof MPEG 2 decoder reference implementation mpeg2decode b mei16v2 m2v f r r uses double precision inverse DCT cumulative time seconds 90 48 0 19 4 76 0 20 self seconds 0 19 0 01 self total calls ns call ns call 7920 23989 90 23989 90 2148 4655 49 4655 49


View Full Document

MIT 6 189 - Lecture 10 Performance Monitoring and Optimizations

Download Lecture 10 Performance Monitoring and Optimizations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 10 Performance Monitoring and Optimizations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 10 Performance Monitoring and Optimizations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?