UMD CMSC 714 - Lecture Slides - D2638567

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 714> Lecture Slides

DOC PREVIEW

UMD CMSC 714 - Lecture Slides

School name University of Maryland, College Park

Course Cmsc 714- High Performance Computing Systems

Pages 20

This preview shows page 1-2-19-20 out of 20 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthAnnouncementsz No Class next Thursdayz Class will meet on Friday (9/15) in room 3450 AV Williams– 9:30-10:452CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthSeismic Codez Given echo data, compute under sea mapz Computation model– designed for a collection of workstations– uses variation of RPC model– workers are given an independent trace to compute• requires little communication• supports load balancing (1,000 traces is typical)z Performance– max mfops = O((F * nz * B*)1/2)– F - single processor MFLOPS– nz - linear dimension of input array–B*- effective communication bandwidth•B*= B/(1 + BL/w) ≈ B/7 for Ethernet (10msec lat., w=1400)– real limit to performance was latency not bandwidth3CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthDatabase Applicationsz Too much data to fit in memory (or sometimes disk)– data mining applications (K-Mart has a 4-5TB database)– imaging applications (NASA has a site with 0.25 petabytes)• use a fork lift to load tapes by the palletz Sources of parallelism– within a large transaction– among multiple transactions z Join operation– form a single table from two tables based on a common field– try to split join attribute in disjoint buckets• if know data distribution is uniform its easy• if not, try hashing4CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthSpeedup in Join parallelismz Books claims a speed up of1/p2is possible– split each relation into p buckets• each bucket is a disjoint subset of the joint attribute– each processor only has to consider N/p tuples per relation•join is O(n2) so each processor does O((N/p)2) work• so spedup is O(N2/p2)/O(N2) = O(1/p2)z this is a lie!• could split into 1/p buckets on one processor• time would then be O(p * (N/p)2) = O(N2/p)• so speedup is O(N2/p2)/O(N2/p) = O(1/p)– Amdahls law is not violated5CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthParallel Search (TSP)z may appear to be faster than 1/n– but this is not really the case eitherz Algorithm– compute a path on a processor • if our path is shorter than the shortest one, send it to the others.• stop searching a path when it is longer than the shortest.– before computing next path, check for word of a new min path– stop when all paths have been explored.z Why it appears to be faster than 1/n speedup– we found the a path that was shorter sooner– however, the reason for this is a different search order!6CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthEnsuring a fair speedupz Tserial= faster of– best known serial algorithm– simulation of parallel computation• use parallel algorithm• run all processes on one processor– parallel algorithm run on one processorz If it appears to be super-linear– check for memory hierarchy• increased cache or real memory may be reason– verify order operations is the same in parallel and serial cases7CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthQuantitative Speedupz Consider master-worker– one master and n worker processes– communication time increases as a linear function of nTp= TCOMPp+ TCOMMpTCOMPp= Ts/P1/Sp= Tp/Ts= 1/P + TCOMMp/TsTCOMMpis P * TCOMM11/Sp=1/p + p * TCOMM1/Ts= 1/P + P/r1where r1= Ts/TCOMM1d(1/Sp)/dP = 0 --> Popt= r11/2 and Sopt= 0.5 r11/2z For hierarchy of masters–TCOMMp= (1+logP)TCOMM1–Popt= r1and Sopt= r1/(1 + log r1)8CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthMPIz Goals:– Standardize previous message passing:• PVM, P4, NX– Support copy free message passing– Portable to many platformsz Features:– point-to-point messaging– group communications– profiling interface: every function has a name shifted versionz Buffering– no guarantee that there are buffers– possible that send will block until receive is calledz Delivery Order– two sends from same process to same dest. will arrive in order– no guarantee of fairness between processes on recv.9CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthMPI Communicatorsz Provide a named set of processes for communicationz All processes within a communicator can be named– numbered from 0…n-1z Allows libraries to be constructed– application creates communicators– library uses it– prevents problems with posting wildcard receives• adds a communicator scope to each receivez All programs start will MPI_COMM_WORLD10CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthNon-Blocking Functionsz Two Parts– post the operation–wait for resultsz Also includes a poll option– checks if the operation has finishedz Semantics– must not alter buffer while operation is pending11CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthMPI Misc.z MPI Types– All messages are typed• base types are pre-defined: – int, double, real, {,unsigned}{short, char, long}• can construct user defined types– includes non-contiguous data typesz Processor Topologies– Allows construction of Cartesian & arbitrary graphs– May allow some systems to run fasterz What’s not in MPI-1– process creation– I/O– one sided communication12CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthMPI Housekeeping Callsz Include <mpi.h> in your programz If using mpich, …z First call MPI_Init(&argc, &argv)z MPI_Comm_rank(MPI_COMM_WORLD, &myrank)– Myrank is set to id of this processz MPI_Wtime– Returns wall timez At the end, call MPI_Finalize()13CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthMPI Communication Callsz Parameters– var – a variable– num – number of elements in the variable to use– type {MPI_INT, MPI_REAL, MPI_BYTE}– root – rank of processor at root of collective operation– dest – rank of destination processor– status - variable of type MPI_Status;z Calls (all return a code – check for MPI_Success)– MPI_Send(var, num, type, dest, tag, MPI_COMM_WORLD)– MPI_Recv(var, num, type, dest, MPI_ANY_TAG, MPI_COMM_WORLD, &status)– MPI_Bcast(var, num, type, root, MPI_COMM_WORLD)– MPI_Barrier(MPI_COMM_WORLD)14CMSC 714 – F06 (lect 3)copyright 2006 Jeffrey K. HollingsworthPVMz Provide a simple, free, portable parallel environmentz Run on everything– Parallel Hardware: SMP, MPPs, Vector Machines– Network of Workstations: ATM, Ethernet,• UNIX machines and PCs running Win*– Works on a heterogenous

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-19-20 out of 20 pages.

UMD CMSC 714 - Lecture Slides

Sign up for free to view:

Please select your school