DOC PREVIEW
UMD CMSC 714 - An Integrated Runtime and Compile-Time Approach

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured ApplicationsGagan Agrawal, Alan Sussman, Joel SaltzPresented by Sam AngiuoliOverview Using a combination of compile-time and run-time analysis to optimize memory access and load balancing Targeted for HPF applications on distributed memory machines where data distribution and access parameters are not known at compile time Compiler extensions insert calls to a library of routines that can optimize data access at runtime2Multigrid Multigrid applications Use a combination of course and fine meshes to accelerate computation via approximations Used for solving PDEs Mesh resolution and nesting can vary at runtime Data distribution may be necessary within a meshes and when propagating values between meshes (0,4)• (1,4)• (2,4)• (3,4)• (4,4)• (0,3)• (1,3)• (2,3)• (3,3)• (4,3)• (0,2)• (1,2)• (2,2)• (3,2)• (4,2)• (0,1)• (1,1)• (2,1)• (3,1)• (4,1)• (0,0)• (1,0)• (2,0)• (3,0)• (4,0)• (0,2) • (1,2) • (2,2) • (0,1) • (1,1) • (2,1) • (0,0) • (1,0) • (2,0) •Multiblock Multiblock applications Sets of meshes (blocks) with non-uniform structure Adjacent blocks may have different mesh sizes Outer loop performs time step Data distribution both within and between blocks Data distribution dependent on the object’s mesh3Runtime library optimizations Multiblock Parti library Dynamic data distribution Regular section moves Partitioning loops via symbolic loop bounds and stridesRuntime primitives Communication schedule Schedule describes data motion between processors Goal is to look at the data distribution at runtime and build a schedule that optimizes data communication Schedules can be reused for similar data distributions Data movement4Data distribution Regular section move Forall loops with array assignments where loop bounds and strides not known at compile time Schedule determines data elements sent and received by each processor Moving array elements between processors ie. From one distributed array to another Regular_section_copy_sched(…) Often only ghost/overlap cells between adjacent blocks/meshes need to copied Overlap_cell_fill_sched(…)Data distribution con’t Loop partitioning Æ handling symbolic loop bounds and strides Owner computes rule Loop iteration performed by process owning LHS array element Loop bound transformations Local_lower_bound()/Local_upper_bound() used to transform loop bounds Mapping for local indices Local_to_global/global_to_local5Compiler support Extensions to accommodate operations in the runtime library Extend processor abstraction to support subspaces PROCESSORS P(N) PSUBSPACE P1 IS P(UPPER:LOWER) Extend Align to specify border/ghost cells ALIGN A(i,j) WITH T(i:2:3,j:2:3)Communication patterns Methodology for analyzing forall loops and performing data moves when necessary Case I Array A,B aligned to different template No information about relationship Case II Array A,B aligned to same template A,B same size and shape Case III Array A,B aligned to same template Different loop bounds, strides6Communication patterns Look at loop bounds/strides and classify loops as follows Not requiring any communication Can be handled by filling overlap/ghost cells  Requiring regular section moves7Overhead Cost of copying (I vs. II) < 5% Scheduling time is small  Especially true on larger problems with regular access patternsCase studies Comparing compiler optimized with hand optimized codes Both using same runtime library routines Compiler optimized within 10-20% of hand optimized8More results Version I rebuilds schedule during each loop iteration Increased communication from distributing blocks over entire process


View Full Document

UMD CMSC 714 - An Integrated Runtime and Compile-Time Approach

Documents in this Course
MTOOL

MTOOL

7 pages

BOINC

BOINC

21 pages

Eraser

Eraser

14 pages

Load more
Download An Integrated Runtime and Compile-Time Approach
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view An Integrated Runtime and Compile-Time Approach and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view An Integrated Runtime and Compile-Time Approach 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?