ME964 High Performance Computing for Engineering ApplicationsBefore we get started…Key Parallel Programming StepsWhat’s Comes Next?Before We Dive In…Supporting Structures ~Program and Data Models~Program Structuring PatternsProgram Structuring Patterns (the rest of the pack)Review: Algorithm StructureAlgorithm Structures vs. Supporting StructuresSupporting Structures vs. Program ModelsMore on SPMDTypical SPMD Program PhasesCore Computation PhaseA Simple ExampleSPMD Code Version 1Problems with Version 1SPMD Code Version 2SPMD Code Version 3Comparing the Three VersionsData StylesSlide 22The Reminder of the SemesterMidterm ProjectFinal ProjectThe Final Project: Further CommentsThe Final Project: Concluding RemarksSlide 28The Collision Problem (The Midterm Project)Proposed WorkflowProject templateSpheres: Possible Contact CasesPossible Contact Cases (Contd.)Bullet Physics LibraryScreenshot, BulletRelevant VariablesVariable DescriptionThe Passed/Failed IssueCommon Sense AdviceCollision Detection ~ Some Possible Approaches ~ReferencesBrute ForceBaraff – Sorting axisExample: Why this is tricky at timesHarada , GPU Gems 3Harada (Contd.)Scott Le Grand, GPU Gems 3Scott Le Grand (Contd.)ME964High Performance Computing for Engineering ApplicationsParallel Programming PatternsDiscussion of Midterm ProjectOct. 21, 2008Before we get started…Last Time Parallel Programming patternsThe “Finding Concurrency Design Space”The “Algorithm Structure Design Space”TodayParallel Programming patternsDiscuss the “Supporting Structures Design Space”Discussion of the Midterm ProjectAssigned today, due on 11/18 or 11/09 (based on level of difficulty)Other issues:I will be in Milwaukee on Th, Mik will be the lecturer - tracing the evolution of the hardware support for the graphics pipeline 2Key Parallel Programming Steps1) Find the concurrency in the problem2) Structure the algorithm so that concurrency can be exploited3) Implement the algorithm in a suitable programming environment4) Execute and tune the performance of the code on a parallel system3What’s Comes Next?Focus on thisfor a while4Before We Dive In…You hover in this design space when you ponder whether an MPI or an SPMD or a event driven simulation environment is what it takes to optimally design your algorithmWe are not going to focus on this process too much since it’s clear that we are stuck with SPMDAll that we’ll be doing is to compare SPMD with other parallel programming paradigms to understand when SPMD is good and when it is badAlso, we will not cover the “Implementation Mechanisms” design space since we don’t need to wrestle with how to spawn and join threads, how to do synchronization, etc.This would be Chapter 6 of the book5Supporting Structures ~Program and Data Models~Fork/JoinMaster/WorkerSPMDProgram ModelsLoop ParallelismDistributed ArrayShared QueueShared DataData ModelsThese are not necessarily mutually exclusive.6Program Structuring PatternsSPMD (Single Program, Multiple Data)All PE’s (Processor Elements) execute the same program in parallelEach PE has its own dataEach PE uses a unique ID to access its portion of dataDifferent PE can follow different paths through the same codeThis is essentially the CUDA Grid modelMaster/WorkerLoop ParallelismFork/Join7Program Structuring Patterns(the rest of the pack)Master/WorkerA Master thread sets up a pool of worker threads and a bag of tasksWorkers execute concurrently, removing tasks until doneLoop ParallelismLoop iterations execute in parallelFORTRAN do-all (truly parallel), do-across (with dependence)Very common in OpenMPFork/JoinMost general way of creation of threads (the POSIX standard)Can be regarded as a very low level approach in which you use the OS to manage parallelism8Review: Algorithm StructureAlgorithm DesignOrganize by TaskOrganize by DataOrganize by Data FlowLinear Recursive Linear RecursiveTaskParallelismDivide andConquerGeometricDecompositionRecursiveDataRegular IrregularPipeline Event Driven9Algorithm Structures vs. Supporting StructuresTask Parallel. Divide/ConquerGeometric Decomp.Recursive DataPipeline Event-basedSPMD☺☺☺☺ ☺☺☺ ☺☺☺☺ ☺☺ ☺☺☺ ☺☺Loop Parallel☺☺☺☺ ☺☺ ☺☺☺Master/Worker☺☺☺☺ ☺☺ ☺ ☺ ☺ ☺Fork/Join☺☺ ☺☺☺☺ ☺☺ ☺☺☺☺ ☺☺☺☺Four smilies is the best (Source: Mattson, et al.) 10Supporting Structures vs. Program ModelsOpenMP MPI CUDASPMD ☺☺☺ ☺☺☺☺ ☺☺☺☺Loop Parallel☺☺☺☺ ☺Master/Slave☺☺ ☺☺☺Fork/Join ☺☺☺11More on SPMDDominant coding style of scalable parallel computingMPI code is mostly developed in SPMD styleMany OpenMP code is also in SPMD (next to loop parallelism)Particularly suitable for algorithms based on data parallelism, geometric decomposition, divide and conquer.Main advantageTasks and their interactions visible in one piece of source code, no need to correlated multiple sourcesSPMD is by far the most commonly used pattern for structuring parallel programs. 12Typical SPMD Program PhasesInitializeEstablish localized data structure and communication channelsObtain a unique identifierEach thread acquires a unique identifier, typically in the range from 0 to N-1, where N is the number of threads.Both OpenMP and CUDA have built-in support for this.Distribute DataDecompose global data into chunks and localize them, orSharing/replicating major data structure using thread ID to associate subset of the data to threadsRun the core computationMore details in next slide…FinalizeReconcile global data structure, prepare for the next major iteration13Core Computation PhaseThread IDs are used to differentiate behavior of threadsUse thread ID in loop index calculations to split loop iterations among threadsUse conditions based on thread ID to branch to their specific actionsBoth can have very different performance results and code complexity depending on the way they are done.14A Simple ExampleAssumeThe computation being parallelized has 1,000,000 iterations.Sequential code:Num_steps = 1000000;for (i=0; i< num_steps, i++) {…}15SPMD Code Version 1Assign a chunk of iterations to each threadThe last thread also finishes up the remainder iterationsnum_steps = 1000000;i_start = my_id * (num_steps/num_threads);i_end =
View Full Document