1CMSC 714Lecture 6High Performance Fortran (HPF)Alan Sussman2CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthNotesz Programming assignment– First part (sequential and either MPI or OpenMP) due next Thursday, Oct. 6– You should have received email with which parallel version to do – if not, send email to Dr. Sussmanz Next class– Dr. Sussman will finish talking about Titanium, and take questions on Titanium and OpenMP vs. MPI– Nick Rutar will talk about the functional programming language Sisal3CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthHPF Model of Computationz goal is to generate loosely synchronous program– main target was distributed memory machinesz Explicit identification of parallel work– forall statementz Extensions to FORTRAN90– the forall statement has been added to the language– the rest of the HPF features are comments/pragmas• any HPF program can be compiled seriallyz Key Feature: Data Distribution– how should data be allocated to nodes?– critical questions for distributed memory machines– turns out to be useful for SMP too since it defines locality4CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthHPF Language Conceptsz Virtual processor– an abstraction of a CPU– can have one and two dimensional arrays of VPs– each VP may map to a physical processor• several VP’s may map to the same processorz Template– a virtual array (no data)– used to describe how real array are aligned with each other– templates are distributed onto to virtual processorsz Align directives– expresses how data different arrays should be aligned– uses affine functions of array indexes• align element I of array A with element I+3 of B5CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthDistribution Optionsz BLOCK– divide data into N (one per VP) contiguous unitsz CYCLIC– assign data in round robin fashion to each processorz BLOCK(n)– groups of n units of data are assigned to each processor– must be at least (array size)/n virtual processorsz CYCLIC(n)– n units of contiguous data are assigned round robin– CYCLIC is the same as CYCLIC(1)z Each can be applied separately to each dimension of a multi-dimensional array6CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthComputationz Where should the computation be performed?z Goals: – do the computation near the data• non-local data requires communication– keep it simple• HPF compilers are already complexz Compromise: “owner computes”– computation is done on the node that contains the lhs of a statement– non-local data for the rhs operands are sent to the node as needed, often before a forall loop starts27CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthFinding the Data to Usez Easy Case– the location of the data is known at compile timez Challenging case– the location of the data is a known (invertible) function of input parameters such as array sizez Difficult Case (irregular computation)– data location is a function of data– indirection array used to access data A[index[I],j] = ...8CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthChallenging Casez Each processor can identify its data to send/receive– use a pre-processing loop to identify the data to movefor each local element Ireceive_list = global_to_proc(f(I))send_list = global_to_proc(f-1(I))send data in send_list and receive data in receive_listfor each local rhs element Iperform the computation9CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthIrregular Computationz Pre-processing step requires data to be sent/received– since we might need to access non-local index arraysz two possible cases– Gather: a(I) = b(u(I))• pre-processing builds a receive list for each processor• send list is known based on data layout– Scatter: a(u(I)) = b(I)• pre-processing builds a send list for each processor• receive list is known based on data layout10CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthCommunication Libraryz How is HPF different from PVM/MPI?– abstraction based on distributed, but global arrays• provides some support for index translation• PVM/MPI only has local arrays– multicast is in one dimension of an array only– shifts and concatenation provided– special ops for moving vectors of send/recv lists in the library for the compiler to use• precomp_read• postcomp_writez Goals– written in terms of native message passing– tries to provide a single portable abstraction to compile to11CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthPerformance Resultsz How good are the speedup results?– only one application shown– speedup is similar to hand tuned message passing program• one extra log(n) communication operations decreases performance– how good is the hand tuned program?• speedup is only 6 on 16 processorsz What is Figure 4 showing?– compares performance on two different machines– no explanation• is this showing the brand x is better then brand y?• does it show that their compiler doesn’t work on brand y?– lesson: figures should always tell a story• don’t require the reader to guess the storyHPF on the Earth Simulator313CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthEarth Simulator – The Building14CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthEarth Simulator15CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthEarth Simulator - Processor16CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthEarth Simulator17CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthIMPACT-3Dz HPF Code– Uses data distribution in one dimensionz Vector Code– Uses inner most array dimensionz Achieves 14.9 Tflops (45% of peak)z Got 39% of peak using traditional HPF– 45 lines of directives– 1,334 lines of executable
View Full Document