DOC PREVIEW
UMD CMSC 714 - Lecture Slides

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthIntroductionz Reading – Papersz Homework #1 Due on Tuesday (in class)2CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthOpenMPz Support Parallelism for SMPs– provide a simple portable model– allows both shared and private data– provides parallel do loopsz Includes – automatic support for fork/join parallelism– reduction variables– atomic statement• one processes executes at a time– single statement• only one process runs this code (first thread to reach it)3CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthSample Codeprogram compute_piinteger n, idouble precision w, x, sum, pi, f, ac function to integratef(a) = 4.d0 / (1.d0 + a*a)print *, \021Enter number of intervals: \021read *,nc calculate the interval sizew = 1.0d0/nsum = 0.0d0!$OMP PARALLEL DO PRIVATE(x), SHARED(w)!$OMP& REDUCTION(+: sum)do i = 1, nx = w * (i - 0.5d0)sum = sum + f(x)enddopi = w * sumprint *, \021computed pi = \021, pistopend4CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthUPCz Extension to C for parallel computingz Target Environment– Distributed memory machines– Cache Coherent multi-processorsz Features– Explicit control of data distribution– Includes parallel for statement5CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthUPC Execution Modelz SPMD-based– One thread per processor– Each thread starts with same entry to mainz Different consistency models possible– “strict” model is based on sequential consistency– “relaxed” based on release consistency6CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthForall Loopz Forms basis of parallelismz Add forth parameter to for loop “affinity”– Where code is executed is based on “affinity”z Lacks explict barrier before/after execution– Differs from openMPz Supports nested forall loops7CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthSplit-phase Barriersz Traditional Barriers– Once enter barriers, busy-wait until everyone arrivesz Split-phase– Announce intention to enter barrier (upc_notify)–Perform some local operations– Wait for everyone else (upc_wait)z Advantage– Allows work while waiting for processes to arrivez Disadvantage– Must find work to do– Takes time to communicate both notify and wait8CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthHPF Model of Computationz goal is to generate loosely synchronous program– original target was distributed memory machinesz Explicit identification of parallel work– forall statementz Extensions to FORTRAN– the forall statement has been added to the language– the rest of the HPF features are comments • any HPF program can be compiled seriallyz Key Feature: Data Distribution– how should data be allocated to nodes?– critical questions for distributed memory machines– turns out to be useful for SMP too since it defines locality9CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthHPF Language Conceptsz Virtual processor– an abstraction of a CPU– can have one and two dimensional arrays of VPs– each VP may map to a physical processor• several VP’s may map to the same processorz Template– a virtual array (no data)– used to describe how real array are aligned with each other– templates are distributed onto to virtual processorsz Align directives– expresses how data different arrays should be aligned– uses affine functions• align element I of array A with element I+3 of B10CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthDistribution Optionsz BLOCK– divide data into N (one per VP) contiguous unitsz CYCLIC– assign data in round robin fashion to each processorz BLOCK(n)– groups of n units of data are assigned to each processor– must be exactly (array size)/n virtual processorsz CYCLIC(n)– n units of contiguous data are assigned round robin– CYCLIC is the same as CYCLIC(1)11CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthComputationz Where should the computation be performed?z Goals: – do the computation near the data• non-local data requires communication– keep it simple• HPF compilers are already complexz Compromise: “owner computes”– computation is done on the node that contains the lhs of a statement– non-local data for the rhs operands are sent the node as needed12CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthFinding the Data to Usez Easy Case– the location of the data is known at compile timez Challenging case– the location of the data is a known (invertable) function of input parameters such as array sizez Difficult Case (irregular computation)– data location is a function of data– indirect array used to access data A[index[I],j] = ...13CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthChallenging Casez Each processor can identify its data to send/recv– use a pre-processing loop to identify the data to to movefor each local element Ireceive_list = global_to_proc(f(I))send_list = global_to_proc(f-1(I))send data in send_list and receive data in receive_listfor each local rhs element Iperform the computation14CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthIrregular Computationz Pre-processing step requires data to be sent– since we might need to access non-local index arraysz two possible cases– gather a(I) = b(u(I))• pre-processing builds a receive list for each processor• send list is known based on data layout– scatter a(u(I)) = b(I)• pre-processing builds a send list for each processor• receive list is known based on data layout15CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthCommunication Libraryz How is it different from pvm?– abstraction based on distributed, but global arrays• provides some support for index translation• pvm has local arrays– multicast is in one dimension of a array only– shifts and concatenation provided– special ops for moving vectors of send/recv lists• precomp_read• postcomp_writez Goals– written in terms of native message passing– tries to provide a single portable abstraction to compile to16CMSC 714 – F06 (lect 05)copyright 2006 Jeffrey K. HollingsworthPerformance Resultsz How good are the speedup results?– only one application shown– speedup is similar to hand tuned message passing


View Full Document

UMD CMSC 714 - Lecture Slides

Documents in this Course
MTOOL

MTOOL

7 pages

BOINC

BOINC

21 pages

Eraser

Eraser

14 pages

Load more
Download Lecture Slides
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Slides and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Slides 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?