MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms6.189 IAP 2007 Lecture 10 Performance Monitoring and Optimizations 1 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 20076.189 IAP 2007 MITReview: Keys to Parallel Performance ● Coverage or extent of parallelism in algorithm Amdahl’s Law ● Granularity of partitioning among processors Communication cost and load balancing ● Locality of computation and communication Communication between processors or between processors and their memories 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 2Communication Cost Model total data sent number of messages C = f ∗ (o + l + n / m + t )overlap−B frequency cost induced by amount of latency of messages contention per hidden by concurrency message with computation overhead per message (at both ends) bandwidth along path (determined by network)network delay per message 3 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Overlapping Communication with Computation Get Data Memory is idle CPU is idle Compute synchronization point Get Data Compute 4 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Limits in Pipelining Communication ● Computation to communication ratio limits performance gains from pipelining Get Data Compute Get Data Compute ● Where else to look for performance? 5 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Artifactual Communication ● Determined by program implementation and interactions with the architecture ● Examples: Poor distribution of data across distributed memories Unnecessarily fetching data that is not used Redundant data fetches 6 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Lessons From Uniprocessors ● In uniprocessors, CPU communicates with memory ● Loads and stores are to uniprocessors as _______ and ______ are to distributed memory multiprocessors “get” “put” ● How is communication overlap enhanced in uniprocessors? Spatial locality Temporal locality 7 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Spatial Locality ● CPU asks for data at address 1000 ● Memory sends data at address 1000 … 1064 Amount of data sent depends on architecture parameters such as the cache block size ● Works well if CPU actually ends up using data from 1001, 1002, …, 1064 ● Otherwise wasted bandwidth and cache capacity 8 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Temporal Locality ● Main memory access is expensive ● Memory hierarchy adds small but fast memories (caches) near the CPU Memories get bigger as distance from CPU increases main memory cache (level 2) cache (level 1) ● CPU asks for data at address 1000 ● Memory hierarchy anticipates more accesses to same address and stores a local copy ● Works well if CPU actually ends up using data from 1000 over and over and over … ● Otherwise wasted cache capacity 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 9Reducing Artifactual Costs in Distributed Memory Architectures ● Data is transferred in chunks to amortize communication cost Cell: DMA gets up to 16K Usually get a contiguous chunk of memory ● Spatial locality Computation should exhibit good spatial locality characteristics ● Temporal locality Reorder computation to maximize use of data fetched 10 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 20076.189 IAP 2007 Single Thread Performance: the last frontier in the search for performance? 11 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Single Thread Performance ● Tasks mapped to execution units (threads) ● Threads run on individual processors (cores) sequential sequential parallel parallel finish line: sequential time + longest parallel time ● Two keys to faster execution Load balance the work among the processors Make execution on each processor faster 12 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Understanding Performance ● Need some way of measuring performance Coarse grained measurements % gcc sample.c% time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3% time a.out 1.921u 0.093s 0:02.03 99.0% … but did we learn much about what’s going on? #define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N]; void cleara(double a[N]) {int i;for (i = 0; i < N; i++) {a[i] = 0;}}int main() {double s=0,s2=0; int i,j; for (j = 0; j < T; j++) { for (i = 0; i < N; i++) { b[i] = 0; } cleara(a); memset(a,0,sizeof(a)); record start time for (i = 0; i < N; i++) {s += a[i] * b[i];s2 += a[i] * a[i] + b[i] * b[i];} } record stop time printf("s %f s2 %f\n",s,s2);} 13 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Measurements Using Counters ● Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events ● Insert code to start, read, and stop counter Measure exactly what you want, anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect: counters can perturb execution time clear/start stop time 14 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Dynamic Profiling ● Event-based profiling Interrupt execution when an event counter reaches a threshold ● Time-based profiling Interrupt execution every t seconds ● Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for
View Full Document