UMD CMSC 714 - Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms - D2687400

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 714> Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms

DOC PREVIEW

UMD CMSC 714 - Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms

School name University of Maryland, College Park

Course Cmsc 714- High Performance Computing Systems

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Impact of Workload and System Parameters on Next Generation Cluster Scheduling MechanismsYanyong Zhang, Anand Sivasubramaniam, José Moreira, and Hubertus FrankePresented by Gary JacksonIntroduction•Scheduling is important for communication•After simulating some scheduling schemes•Periodic Boost (PB) is the best•Certain PB heuristics are better than othersNeed•Chaos on clusters does not work well•Basic problem: waiting on a message from an unscheduled process•Thus, scheduling and communication need to be considered togetherExisting Schemes:Space Sharing•Wasted time on unscheduled nodes•Wasted time if there’s a lot of I/Oj0 j0j2 j2j1 j1 j1 j1TimeNodesExisting Schemes:Coscheduling (GS)•Not robust for node failure•Requires good synchronization•Long time quanta•Poor performance for many classes of jobs•Wasted time waiting for I/Oj0 j2j1TimeNodesj2j0j1 j1 j1j0 j2j1j2j0j1 j1 j1j0 j2j1j2j0j1 j1 j1j0 j3j1j3j0j1 j1 j1j0 j3j1j3j0j1 j1 j1Solution:Dynamic Coscheduling•Approximating coscheduled execution•Based on messaging events•In this paper, relies on User Level Networking (ULN)•Have already demonstrated PB and Spin Yield (SY) on an existing clusterUser Level Networking (ULN)•Send:•Process enqueues message•NIC polls memory, dequeues and sends•Receive:•Process busy-waits on receive queue•NIC enqueues received message•Process dequeues messageWhy is this good?•Normally, processes block when reading messages from a network•This causes a context switch•In ULN, the busy-wait receive means less frequent context switches•But nothing is getting doneScheduling Strategies•Multiprogramming Level (MPL): number of simultaneous processes on a node•Local Scheduling: unmodified local scheduler•Gang Scheduling (GS): conventional coschedulingWhen waiting...•Spin Block (SB)•Spin for a fixed amount of time•Then block•Spin Yield (SY)•Spin for a fixed amount of time•Then lower own priorityWhen a message arrives...•Demand-Based Coscheduling (DCS)•NIC checks the running process periodically•On message arrival•If the recipient is not running•Interrupt the processor to raise the priority of the recipientWhen a message arrives...•Periodic Boost (PB)•Kernel checks arrival queues periodically•Uses a heuristic to raise the priority of receiving processes•Heuristic used here:•Boost receiving processes that are waiting on a messageScheme SummaryWhat do you do on message arrival?How do you wait for a message?Busy WaitSpin BlockSpin YieldNo RescheduleLocalSBSYInterrupt & RescheduleDCSDCS-SBDCS-SYPeriodically ReschedulePBPB-SBPB-SY•Sophisticated simulator•Workloads•Statistical models based on real-world observation•Eight different workloads•Four different communication patterns•Twelve different parameters for experimentsExperimental PlatformMetrics•Response = Wait + Execution time•Slowdown•Utilization•Fairness•Performance ProfileExperiment: Load•Test with a fixed frequency and nominal length of job•Measure average response time and utilization•Conclusion: PB gets better response time and utilization than the othersExperiment: Load•What does this graph mean? There are two dependent axes.•What do they mean by “saturation”?Experiment:Nature of Workload•Test on different workload types•Measure response time•Conclusions•PB is superior for communication•PB, SB, PB-SY, and PB-SB were otherwise indistinguishable•Local and SY were universally badExperiment:Multiprogramming Level•Test the simulator with three different workloads and MPL=2, 5, and 16•Measure response time and performance profile•Conclusion: PB is better at lower MPL, and is about the same as SB at higher MPLExperiment: Skewness•Skewness: variation introduced in to the computation and I/O phases to cause the programs to be out of synchronization•Test with 20% skewness and 150% skewness and an even workload•Measure response time and performance profileExperiment:System Overhead•Test with three sets of overhead costs•Measure response time•Conclusion: blocking based schemes reacted the best to shortened overhead costsExperiment: Communication Patterns•Test with four different communication patterns•Measure response time•Conclusion: GS is best for AA, PB is best overallNearest Neighbor (NN)All-to-AllLinearTreeExperiment:Number of Nodes•Test with 16 and 32 nodes•Job sizes remain the same•Measure response time and performance profile•Conclusion: Idle times go up in the blocking schemes, indicating limited scalability•Also, partitioning doesn’t do any goodExperiment: Fairness•Test mix of CPU, I/O, and communication intensive jobs•Measure slowdown for each individual job type•Conclusion: GS is the best, local is the worst, and blocking schemes do better than spinning schemes (no surprises)Experiment SummaryExperimentWinnersLosersLoadPBLocalNature of WorkloadPBLocal, SY, DCS variantsMPLPB lower, SB higherLocal, DCS, SYSkewnessPB and SB variantsSystem OverheadsSB and PB-SB benefited from lowerCommunication Pattern GS for AA, PB overallLocal, SY, DCS variantsNumber of NodesSB variantsFairnessGSLocalPerformance Boost Heuristics•Decisions based on process states•Can be divided along•What the process is doing•Whether the process has unconsumed messagesUnconsumed MessagesPhaseCompute/SendReceiveNoS1S4YesS2S3Performance Boost HeuristicsAS3!{S2, S1}*BS3!S2!S1C{S3, S2, S1}D{S3, S2}!S1ES2!S3!S1*brackets mean equal round-robin considerationTests•Tested all five with three different workloads •CPU intensive, balanced, communication intensive•Measured response time, conditions causing boost, position of boosted processTests•Conclusion: D performs the best•B fails in communication due to lower likelihood of coscheduling•D is more fair than EMore Fairness•Added fair-share policy to all heuristics•Compared heuristics with and without fair-share•Conclusion•Fair sharing makes A’, B’, C’, and D’ more fair•Slight improvement in response timeConclusion•ULN requires changes in scheduling schemes•As evidenced by local scheduler performance•GS as tested doesn’t do as well as dynamic coscheduling•PB is the best until skewness is large, then PB-SB might be better•Heuristic D’ is the best PB heuristicThe

View Full Document