ME964High Performance Computing for Engineering Applications“The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague.” Edsger W. Dijkstra© Dan Negrut, 2011ME964 UW-MadisonParallel Computing using OpenMP[Part 1 of 2]March 31, 2011Before We Get Started… Last time Learn how to run an MPI executable on Newton Point-to-Point Communication with MPI Collective Communication in MPI Today Parallel Computing using OpenMP, part 1 of 2. Other issues Assignment 7 was posted on the class website, due on April 7 Class website includes link to the OpenMP 3.0 Application Programming Interface http://www.openmp.org/mp-documents/spec30.pdf2Acknowledgements The overwhelming majority of slides used for discussing OpenMP issues are from Intel’s library of presentations for promoting OpenMP The slides are used herein with permission Credit is given where due by a “Credit: IOMPP” or “Includes material from IOMPP” message at the bottom of the slide IOMPP stands for “Intel OpenMP Presentation”3Data vs. Task Parallelism Data parallelism You have a large amount of data elements and each data element (or possibly a subset of elements) needs to be processed to produce a result When this processing can be done in parallel, we have data parallelism Example: Adding two long arrays of doubles to produce yet another array of doubles Task parallelism You have a collection of tasks that need to be completed If these tasks can be performed in parallel you are faced with a task parallel job Examples: Reading the newspaper, drinking coffee, and scratching your back The breathing your lungs, beating of your heart, liver function, controlling the swallowing, etc.4Objectives Understand OpenMP at the level where you can Implement data parallelism Implement task parallelism5Credit: IOMPPWork Plan What is OpenMP?Parallel regionsWork sharingData environment Synchronization Advanced topics6Credit: IOMPPOpenMP: Target Hardware CUDA: targeted parallelism on the GPU MPI: targeted parallelism on a cluster (distributed computing) Note that MPI implementation can handle transparently a SMP architecture such as a workstation with two hexcore CPUs that use a large amount of shared memory OpenMP: targets parallelism on SMP architectures Handy when You have a machine that has 12 cores, probably 24 if HTT is accounted for You have a large amount of shared memory that is backed by a 64 bit OS7OpenMP: What to Expect If you have 12 cores available to you, it is *highly* unlikely to get a speedup of more than 12 (superlinear) Recall the trick that helped the GPU hide latency Overcommitting the SPs and hiding memory access latency with warp execution This mechanism of hiding latency by overcommitmentdoes not *explicitly* exist for parallel computing under OpenMP beyond what’s offered by HTT8OpenMP: What Is It? Portable, shared-memory threading API– Fortran, C, and C++– Multi-vendor support for both Linux and Windows Standardizes task & loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source Standardizes ~ 20 years of compiler-directed threading experience Current spec is OpenMP 3.0 http://www.openmp.org 318 Pages 9Credit: IOMPP“pthreads”: An OpenMP Precursor Before there was OpenMP, a common approach to support parallel programming was by use of pthreads “pthread”: POSIX thread POSIX: Portable Operating System Interface [for Unix] pthreads Available originally under Unix and Linux Windows ports are also available some as open source projects Parallel programming with pthreads: relatively cumbersome, prone to mistakes, hard to maintain/scale/expand Moreover, not envisioned as a mechanism for writing scientific computing software10“pthreads”: Example11int main(int argc, char *argv[]) {parm *arg;pthread_t *threads;pthread_attr_t pthread_custom_attr;int n = atoi(argv[1]);threads = (pthread_t *) malloc(n * sizeof(*threads));pthread_attr_init(&pthread_custom_attr);barrier_init(&barrier1); /* setup barrier */finals = (double *) malloc(n * sizeof(double)); /* allocate space for final result */arg=(parm *)malloc(sizeof(parm)*n);for( int i = 0; i < n; i++) { /* Spawn thread */arg[i].id = i;arg[i].noproc = n;pthread_create(&threads[i], &pthread_custom_attr, cpi, (void *)(arg+i));}for( int i = 0; i < n; i++) /* Synchronize the completion of each thread. */pthread_join(threads[i], NULL);free(arg);return 0;}12#include <stdio.h>#include <math.h>#include <time.h>#include <sys/types.h>#include <pthread.h>#include <sys/time.h>#define SOLARIS 1#define ORIGIN 2#define OS SOLARIStypedef struct {int id;int noproc;int dim;} parm;typedef struct {int cur_count;pthread_mutex_t barrier_mutex;pthread_cond_t barrier_cond;} barrier_t;void barrier_init(barrier_t * mybarrier) { /* barrier *//* must run before spawning the thread */pthread_mutexattr_t attr;# if (OS==ORIGIN)pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);pthread_mutexattr_setprioceiling(&attr, 0); pthread_mutex_init(&(mybarrier->barrier_mutex), &attr);# elif (OS==SOLARIS)pthread_mutex_init(&(mybarrier->barrier_mutex), NULL);# else# error "undefined OS"# endifpthread_cond_init(&(mybarrier->barrier_cond), NULL);mybarrier->cur_count = 0;}void barrier(int numproc, barrier_t * mybarrier) {pthread_mutex_lock(&(mybarrier->barrier_mutex));mybarrier->cur_count++;if (mybarrier->cur_count!=numproc) {pthread_cond_wait(&(mybarrier->barrier_cond), &(mybarrier->barrier_mutex));}else {mybarrier->cur_count=0;pthread_cond_broadcast(&(mybarrier->barrier_cond));}pthread_mutex_unlock(&(mybarrier->barrier_mutex));}void* cpi(void *arg) {parm *p = (parm *) arg;int myid = p->id;int numprocs = p->noproc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;double startwtime, endwtime;if (myid == 0) {startwtime = clock();}barrier(numprocs, &barrier1);if (rootn==0)finals[myid]=0;else {h = 1.0 / (double) rootn;sum = 0.0;for(int i = myid + 1; i <=rootn; i += numprocs) {x = h * ((double) i - 0.5);sum += f(x);}mypi = h * sum;}finals[myid] = mypi;barrier(numprocs, &barrier1);if (myid == 0){pi = 0.0;for(int i=0; i < numprocs; i++) pi += finals[i];endwtime = clock();printf("pi is approx
View Full Document