DOC PREVIEW
MIT 6 189 - Lecture 10 Performance Monitoring and Optimizations

This preview shows page 1-2-3-4-29-30-31-32-59-60-61-62 out of 62 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 62 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms6.189 IAP 2007 Lecture 10 Performance Monitoring and Optimizations 1 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 20076.189 IAP 2007 MITReview: Keys to Parallel Performance ● Coverage or extent of parallelism in algorithm  Amdahl’s Law ● Granularity of partitioning among processors  Communication cost and load balancing ● Locality of computation and communication  Communication between processors or between processors and their memories 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 2Communication Cost Model total data sent number of messages C = f ∗ (o + l + n / m + t )overlap−B frequency cost induced by amount of latency of messages contention per hidden by concurrency message with computation overhead per message (at both ends) bandwidth along path (determined by network)network delay per message 3 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Overlapping Communication with Computation Get Data Memory is idle CPU is idle Compute synchronization point Get Data Compute 4 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Limits in Pipelining Communication ● Computation to communication ratio limits performance gains from pipelining Get Data Compute Get Data Compute ● Where else to look for performance? 5 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Artifactual Communication ● Determined by program implementation and interactions with the architecture ● Examples:  Poor distribution of data across distributed memories  Unnecessarily fetching data that is not used  Redundant data fetches 6 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Lessons From Uniprocessors ● In uniprocessors, CPU communicates with memory ● Loads and stores are to uniprocessors as _______ and ______ are to distributed memory multiprocessors “get” “put” ● How is communication overlap enhanced in uniprocessors?  Spatial locality  Temporal locality 7 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Spatial Locality ● CPU asks for data at address 1000 ● Memory sends data at address 1000 … 1064  Amount of data sent depends on architecture parameters such as the cache block size ● Works well if CPU actually ends up using data from 1001, 1002, …, 1064 ● Otherwise wasted bandwidth and cache capacity 8 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Temporal Locality ● Main memory access is expensive ● Memory hierarchy adds small but fast memories (caches) near the CPU  Memories get bigger as distance from CPU increases main memory cache (level 2) cache (level 1) ● CPU asks for data at address 1000 ● Memory hierarchy anticipates more accesses to same address and stores a local copy ● Works well if CPU actually ends up using data from 1000 over and over and over … ● Otherwise wasted cache capacity 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 9Reducing Artifactual Costs in Distributed Memory Architectures ● Data is transferred in chunks to amortize communication cost  Cell: DMA gets up to 16K  Usually get a contiguous chunk of memory ● Spatial locality  Computation should exhibit good spatial locality characteristics ● Temporal locality  Reorder computation to maximize use of data fetched 10 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 20076.189 IAP 2007 Single Thread Performance: the last frontier in the search for performance? 11 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Single Thread Performance ● Tasks mapped to execution units (threads) ● Threads run on individual processors (cores) sequential sequential parallel parallel finish line: sequential time + longest parallel time ● Two keys to faster execution  Load balance the work among the processors  Make execution on each processor faster 12 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Understanding Performance ● Need some way of measuring performance  Coarse grained measurements % gcc sample.c% time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3% time a.out 1.921u 0.093s 0:02.03 99.0%  … but did we learn much about what’s going on? #define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N]; void cleara(double a[N]) {int i;for (i = 0; i < N; i++) {a[i] = 0;}}int main() {double s=0,s2=0; int i,j; for (j = 0; j < T; j++) { for (i = 0; i < N; i++) { b[i] = 0; } cleara(a); memset(a,0,sizeof(a)); record start time for (i = 0; i < N; i++) {s += a[i] * b[i];s2 += a[i] * a[i] + b[i] * b[i];} } record stop time printf("s %f s2 %f\n",s,s2);} 13 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Measurements Using Counters ● Increasingly possible to get accurate measurements using performance counters  Special registers in the hardware to measure events ● Insert code to start, read, and stop counter  Measure exactly what you want, anywhere you want  Can measure communication and computation duration  But requires manual changes  Monitoring nested scopes is an issue  Heisenberg effect: counters can perturb execution time clear/start stop time 14 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007Dynamic Profiling ● Event-based profiling  Interrupt execution when an event counter reaches a threshold ● Time-based profiling  Interrupt execution every t seconds ● Works without modifying your code  Does not require that you know where problem might be  Supports multiple languages and programming models  Quite efficient for


View Full Document

MIT 6 189 - Lecture 10 Performance Monitoring and Optimizations

Download Lecture 10 Performance Monitoring and Optimizations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 10 Performance Monitoring and Optimizations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 10 Performance Monitoring and Optimizations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?