DOC PREVIEW
CMU CS 15745 - Compiling with multicore

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Compiling with multicorePapersFirst paperWhat is the paper about?Why decoupled pipelining?Slide 6Slide 7DSWPDSWP AlgorithmBuild dependence graphFind SCCCreate DAG of SCCsPartition DAGSplit codes and insert flows (done!)ResultSecond PaperSlide 17Slide 18InterfaceDynamic analysisSlide 21Slide 22Slide 23Actual pipeliningSlide 251Compiling with multicoreJeehyung Lee15-745 Spring 20092Papers Automatic Thread Extraction with Decoupled Software PipeliningFully automaticFine grained pipeliningA Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automaticCoarse grained pipelining3First paper Automatic Thread Extraction with Decoupled Software PipeliningGuilherme Ottoni, Ram Rangan, Adam Stoler and David AugustFrom Princeton University4What is the paper about?Despite increasing uses of multiprocessors, many single threaded applications do not benefitLet the compiler automatically extract threads and exploit lurking pipeline parallelismExtract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)5Why decoupled pipelining?Example Linked list traversal6Why decoupled pipelining?DOACROSS Iteration * (LD latency + communication latency)7Why decoupled pipelining?DSWP Iteration * LD latencyOne way pipelining8DSWPFlow of data (dependency) is acyclic among coresWith use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency9DSWP AlgorithmBuild dependence graphFind strongly connected components (SCC)Create DAG of SCCPartition DAG Split codes into partitionsAdd flows to partitions10Build dependence graphInclude every traditional dependence (data, control, and memory) & extensions11Find SCCSCC : Instructions that form a dependency cycle in a loop Instructions in SCC cannot be parallelized 12112212Create DAG of SCCsMerge instructions within each SCC and update dependency arrows13Partition DAGPartition DAG nodes into n partitions ( n <= # of processors)Use heuristic to maximize load balanceDecide # of partitions (threads)Start filling in from partition 1 with nodes from the top of DAG. When the partition is stuffed (estimated by # of cycles), move on to next partitionFind the best # of threads and its partition14Split codes and insert flows (done!)For each partition, insert code basic blocks relevant to its contained SCC nodeAdd in codes for dependency flow15Result19.4% speedup on important benchmark loops, 9.2% overall When core bandwidth is halvedSingle threaded code slows down by 17.1%DSWP code is still slightly faster than single-threaded code running on full-bandwidth corePromising enabler for Thread-Level-Parallelism(TLP)?16Second PaperA Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C ProgramsWilliam Thies, Vikram Chandrasekhar and Saman AmaransingheFrom MIT17What is the paper about?Despite increasing uses of multiprocessors, many single threaded… (Repeated)Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codesLet people define pipeline, and learn practical dependencies in runtime18What is the paper about?Despite increasing uses of multiprocessors, many single threaded… (Repeated)Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codesLet people define stages, and learn practical dependencies in runtime …for streaming applications19InterfaceAdd annotations in the body of top loop20Dynamic analysisThe system creates a stream graph according to annotations.How do they find dependencies?21Dynamic analysisStreaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages22Dynamic analysisRun the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies23InterfaceProgram shows a complete stream graphUser decides if he/she likes thispipelining or not• If yes, done!• else, redo annotations. Iterate over until satisfied24Actual pipeliningWhen compiled, annotation macros emit codes that will fork original program for each pipeline stage25ResultAverage 2.78x speedup, max 3.89x on 4-coreSeems unsound but practical


View Full Document

CMU CS 15745 - Compiling with multicore

Documents in this Course
Lecture

Lecture

14 pages

Lecture

Lecture

19 pages

Lecture

Lecture

8 pages

Lecture

Lecture

5 pages

Lecture

Lecture

6 pages

lecture

lecture

17 pages

Lecture 3

Lecture 3

12 pages

Lecture

Lecture

17 pages

Lecture

Lecture

18 pages

lecture

lecture

14 pages

lecture

lecture

8 pages

lecture

lecture

5 pages

Lecture

Lecture

19 pages

lecture

lecture

10 pages

Lecture

Lecture

20 pages

Lecture

Lecture

8 pages

Lecture

Lecture

7 pages

lecture

lecture

59 pages

Lecture

Lecture

10 pages

Task 2

Task 2

2 pages

Handout

Handout

18 pages

Load more
Download Compiling with multicore
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Compiling with multicore and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Compiling with multicore 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?