DOC PREVIEW
UW-Madison CS 838 - Profiling and Parallelization of the Multifacet

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 838: Pervasive Parallelism Profiling and Parallelization of the Multifacet GEMS Simulation InfrastructureProblemMore Motivation – Why Parallelize?Slide 4SummarySlide 6What Next?Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Finding ParallelismExperiment 1: Ruby’s Event QueueResults 1 – Ruby’s Event QueueExperiment 2 – Opal’s Per-Processor ParallelismSlide 19Slide 20Experiment 3 – Simics API CallsSlide 22Intermediate ConclusionsExperiment 4 – “NULL” ModulesSlide 25Slide 26Slide 27Slide 28Slide 29“NULL” Module ObservationsParallelizing RubySlide 32Slide 33Slide 34Closing RemarksThe EndCS 838: Pervasive ParallelismProfiling and Parallelization of the Multifacet GEMS Simulation InfrastructureJake [email protected] [email protected]:Mark D. HillCS 838 2(C) 2005Problem•Simulation is (really) slow!–Simics alone runs at ~ 5 MIPS (fast!)–Add Ruby ~ 50 KIPS–Add Opal ~ 20 KIPS•Fast simulations lead to faster evaluation of new ideas.–Running many simulations in parallel (via Condor, for instance) is great for shrinking error bars, less useful for development.•Fast simulations useful for educational purposes–Remember how long it took to simulate HW 5, HW 6?»Simulations of long-running commercial workloads can take hours or DAYS, even on top-of-the-line hardwareCS 838 3(C) 2005More Motivation – Why Parallelize?Chips currently look like this:A couple of coresMemory & I/O Control On-Chip CacheDual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004CS 838 4(C) 2005More Motivation – Why Parallelize?Soon, chips may look like this:CORECORECORECORECORECORECORECOREInterconnect$ BANK$ BANK$ BANK$ BANKMore cores!Many more threadsThe free lunch is over:To get speedup out of multithreaded processors, programmers must implement parallel programs. (for now)CS 838 5(C) 2005Summary•Good News: Found parallelism in GEMS–Ruby’s event queue often contains independent events–Opal has some implicit parallelism, as it simulates many logically independent processors•Bad News: Speedup potential is limited–In most cases, execution within Simics dominates execution time–Amdahl’s Law suggests parallelization of GEMS will yield small increases in performance•Good News: Discovered inefficiencies–The way GEMS uses Simics greatly affects Simics–Isolated troublesome API calls and stalled processor effects•Bad News: Simics isn’t very thread-friendly–No thread-safe functionality–Calling Simics API requires a (costly) thread switch!CS 838 6(C) 2005Summary•More Bad News: Parallelization of Ruby was not (entirely) successful–Demonstrated little/no performance gain–Suffers from deadlock»We have a good excuse for this…–Nondeterministic»Fixable, minor effect–Assumptions of non-concurrent execution»Ready()/Operate() pairsCS 838 7(C) 2005What Next?•Overview of Simics/Ruby/Opal–Lengthy example•Profiling Experiments–Description of profiling experiments–Results•Effects Ruby / Opal have on Simics–“Null” module experiments•Parallel Ruby–…and its catastrophic failure•Observations•ConclusionsCS 838 8(C) 2005Simics / Ruby / Opal Overview - 1DetailedProcessor ModelOpalSimicsMicrobenchmarksRandomTesterDeterministicContended locksTrace fieGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsSimics loadable modulesE1E2E3E4E5CS 838 9(C) 2005Opal + Ruby + Simics OperationGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsDetailedProcessor ModelOpalSimicsE1E2E3E4E5loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2my_func1: ld R8 C retInstall ModuleInstall ModuleStart Sim F F F FInstruction FetchesI-Fetch CompleteI-Fetch CompleteCS 838 10(C) 2005Opal + Ruby + Simics OperationGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsDetailedProcessor ModelOpalSimicsE1E2E3E4E5loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2my_func1: ld R8 C ret D D D DFFFFInstruction FetchesAPI Calls for DecodingCS 838 11(C) 2005Opal + Ruby + Simics OperationGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsDetailedProcessor ModelOpalSimicsE1E2E3E4E5loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2my_func1: ld R8 C ret XX SS XXDDDDFFFF C XXXX WWXXXXSXXDDDDStep 1 Instr.CS 838 12(C) 2005Opal + Ruby + Simics OperationGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsDetailedProcessor ModelOpalSimicsE1E2E3E4E5loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2my_func1: ld R8 C ret CC CMMSWWXXSXXXXStep 3 Instrs.ld Ald BCS 838 13(C) 2005Opal + Ruby + Simics OperationGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsDetailedProcessor ModelOpalSimicsE1E2E3E4E5loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2my_func1: ld R8 C ret SSSWWSSMWWld CA=1B=1CS 838 14(C) 2005Opal + Ruby + Simics OperationGraphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gemsDetailedProcessor ModelOpalSimicsE1E2E3E4E5loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2my_func1: ld R8 C ret CCCCXFStep 4 Instrs.I-Fetch callSimple, Right?CS 838 15(C) 2005Finding Parallelism•Lots of parallelism opportunities in the example!–Ruby/Opal (as described) could be run by separate threads!–Ruby is a discrete event simulator…»Can we apply Fujimoto’s PDES strategies directly?•Places we found parallelism:–Ruby’s Event Queue (Experiment 1)–Opal in general, on a per-processor basis (Experiment 2)–Modular structure (not explored)•But how much speedup can we gain through parallelism?CS 838 16(C) 2005Experiment 1: Ruby’s Event Queue•Ruby is already a discrete event simulator (DES)–Making it a parallel DES (PDES) ala Fujimoto might be a way to speed things up!–Already has implicit lookahead of 1, due to existing event scheduling constraints.•How many events are available for processing in a given cycle of the event


View Full Document

UW-Madison CS 838 - Profiling and Parallelization of the Multifacet

Documents in this Course
Load more
Download Profiling and Parallelization of the Multifacet
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Profiling and Parallelization of the Multifacet and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Profiling and Parallelization of the Multifacet 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?