Compiling with multicorePapersFirst paperWhat is the paper about?Why decoupled pipelining?Slide 6Slide 7DSWPDSWP AlgorithmBuild dependence graphFind SCCCreate DAG of SCCsPartition DAGSplit codes and insert flows (done!)ResultSecond PaperSlide 17Slide 18InterfaceDynamic analysisSlide 21Slide 22Slide 23Actual pipeliningSlide 251Compiling with multicoreJeehyung Lee15-745 Spring 20092Papers Automatic Thread Extraction with Decoupled Software PipeliningFully automaticFine grained pipeliningA Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automaticCoarse grained pipelining3First paper Automatic Thread Extraction with Decoupled Software PipeliningGuilherme Ottoni, Ram Rangan, Adam Stoler and David AugustFrom Princeton University4What is the paper about?Despite increasing uses of multiprocessors, many single threaded applications do not benefitLet the compiler automatically extract threads and exploit lurking pipeline parallelismExtract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)5Why decoupled pipelining?Example Linked list traversal6Why decoupled pipelining?DOACROSS Iteration * (LD latency + communication latency)7Why decoupled pipelining?DSWP Iteration * LD latencyOne way pipelining8DSWPFlow of data (dependency) is acyclic among coresWith use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency9DSWP AlgorithmBuild dependence graphFind strongly connected components (SCC)Create DAG of SCCPartition DAG Split codes into partitionsAdd flows to partitions10Build dependence graphInclude every traditional dependence (data, control, and memory) & extensions11Find SCCSCC : Instructions that form a dependency cycle in a loop Instructions in SCC cannot be parallelized 12112212Create DAG of SCCsMerge instructions within each SCC and update dependency arrows13Partition DAGPartition DAG nodes into n partitions ( n <= # of processors)Use heuristic to maximize load balanceDecide # of partitions (threads)Start filling in from partition 1 with nodes from the top of DAG. When the partition is stuffed (estimated by # of cycles), move on to next partitionFind the best # of threads and its partition14Split codes and insert flows (done!)For each partition, insert code basic blocks relevant to its contained SCC nodeAdd in codes for dependency flow15Result19.4% speedup on important benchmark loops, 9.2% overall When core bandwidth is halvedSingle threaded code slows down by 17.1%DSWP code is still slightly faster than single-threaded code running on full-bandwidth corePromising enabler for Thread-Level-Parallelism(TLP)?16Second PaperA Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C ProgramsWilliam Thies, Vikram Chandrasekhar and Saman AmaransingheFrom MIT17What is the paper about?Despite increasing uses of multiprocessors, many single threaded… (Repeated)Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codesLet people define pipeline, and learn practical dependencies in runtime18What is the paper about?Despite increasing uses of multiprocessors, many single threaded… (Repeated)Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codesLet people define stages, and learn practical dependencies in runtime …for streaming applications19InterfaceAdd annotations in the body of top loop20Dynamic analysisThe system creates a stream graph according to annotations.How do they find dependencies?21Dynamic analysisStreaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages22Dynamic analysisRun the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies23InterfaceProgram shows a complete stream graphUser decides if he/she likes thispipelining or not• If yes, done!• else, redo annotations. Iterate over until satisfied24Actual pipeliningWhen compiled, annotation macros emit codes that will fork original program for each pipeline stage25ResultAverage 2.78x speedup, max 3.89x on 4-coreSeems unsound but practical
View Full Document