1An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured ApplicationsGagan Agrawal, Alan Sussman, Joel SaltzPresented by Sam AngiuoliOverview Using a combination of compile-time and run-time analysis to optimize memory access and load balancing Targeted for HPF applications on distributed memory machines where data distribution and access parameters are not known at compile time Compiler extensions insert calls to a library of routines that can optimize data access at runtime2Multigrid Multigrid applications Use a combination of course and fine meshes to accelerate computation via approximations Used for solving PDEs Mesh resolution and nesting can vary at runtime Data distribution may be necessary within a meshes and when propagating values between meshes (0,4)• (1,4)• (2,4)• (3,4)• (4,4)• (0,3)• (1,3)• (2,3)• (3,3)• (4,3)• (0,2)• (1,2)• (2,2)• (3,2)• (4,2)• (0,1)• (1,1)• (2,1)• (3,1)• (4,1)• (0,0)• (1,0)• (2,0)• (3,0)• (4,0)• (0,2) • (1,2) • (2,2) • (0,1) • (1,1) • (2,1) • (0,0) • (1,0) • (2,0) •Multiblock Multiblock applications Sets of meshes (blocks) with non-uniform structure Adjacent blocks may have different mesh sizes Outer loop performs time step Data distribution both within and between blocks Data distribution dependent on the object’s mesh3Runtime library optimizations Multiblock Parti library Dynamic data distribution Regular section moves Partitioning loops via symbolic loop bounds and stridesRuntime primitives Communication schedule Schedule describes data motion between processors Goal is to look at the data distribution at runtime and build a schedule that optimizes data communication Schedules can be reused for similar data distributions Data movement4Data distribution Regular section move Forall loops with array assignments where loop bounds and strides not known at compile time Schedule determines data elements sent and received by each processor Moving array elements between processors ie. From one distributed array to another Regular_section_copy_sched(…) Often only ghost/overlap cells between adjacent blocks/meshes need to copied Overlap_cell_fill_sched(…)Data distribution con’t Loop partitioning Æ handling symbolic loop bounds and strides Owner computes rule Loop iteration performed by process owning LHS array element Loop bound transformations Local_lower_bound()/Local_upper_bound() used to transform loop bounds Mapping for local indices Local_to_global/global_to_local5Compiler support Extensions to accommodate operations in the runtime library Extend processor abstraction to support subspaces PROCESSORS P(N) PSUBSPACE P1 IS P(UPPER:LOWER) Extend Align to specify border/ghost cells ALIGN A(i,j) WITH T(i:2:3,j:2:3)Communication patterns Methodology for analyzing forall loops and performing data moves when necessary Case I Array A,B aligned to different template No information about relationship Case II Array A,B aligned to same template A,B same size and shape Case III Array A,B aligned to same template Different loop bounds, strides6Communication patterns Look at loop bounds/strides and classify loops as follows Not requiring any communication Can be handled by filling overlap/ghost cells Requiring regular section moves7Overhead Cost of copying (I vs. II) < 5% Scheduling time is small Especially true on larger problems with regular access patternsCase studies Comparing compiler optimized with hand optimized codes Both using same runtime library routines Compiler optimized within 10-20% of hand optimized8More results Version I rebuilds schedule during each loop iteration Increased communication from distributing blocks over entire process
View Full Document