Unformatted text preview:

Lecture 7 Overview Making it all work Grid MetaComputing Lecture 7 Resource Management and Scheduling From Cluster Batch schedulers to MetaComputing Spring 2003 Dr Graham Fagg Scheduling MPI jobs All we really need to do with support material from Mark Baker and Michael Resch Contents Resource Management and Scheduling Clusters MetaComputing and Super Computers Motivation for CMS Motivation for using Clusters Cycle Stealing Management Software Problems with Distributed Computing Resource Management Cluster Management Software Systems From Clusters to MetaComputing Cluster and Metacomputing Metacomputing is still a research area current implementations are limited mostly applying to LANs rather than WANs Except for when SuperComputing Confferences are taking place Situation is getting better all the time LAN implementations are Cluster Management Software CMS or Cluster Computing Environments CCE MetaComputing systems can be built by extending LAN systems or by using many LAN systems together Cluster Super Computers and Metacomputing A main goal of distributed computing research is to provide users with simple and transparent access to a distributed set of heterogeneous computing resources This is often called Metacomputing the user submits jobs to a virtual distributed computer rather than specifying particular computers Super Computing on the other hand is designed to provide massive computational power to users Performance is more important than usability MetaComputing promises both Cluster and Metacomputing Cluster Management Software Managing clusters of mostly workstations as a distributed compute resource Built on top of existing OS Cluster Computing Environments Software to allow the cluster to be used as an applications environment similar to Distributed Shared Memory DSM systems Built into OS kernel for improved performance 1 Cluster and Metacomputing The World Wide Web is now so ubiquitous that it is becoming the platform of choice for distributed computing Internet or Intranet and Metacomputing See the Webflow project later Mostly used as the server to client binding structure I e submit a web form rather than a job request ticket Motivation for Resource Management Users want to be able to submit their jobs without having to worry about where they run i e submit jobs to a metacomputer virtual computer rather than search for spare cycles on a real computer Ease of use Requires both distributed code as well as data Large organisations companies universities national labs etc typically have hundreds or thousands of powerful workstations for use by employees which is a major under utilised compute resource Check lecture 3 notes on Resource Management What is Spare Cycles and do we want to use them Motivation for using Clusters Surveys show utilisation of CPU cycles of desktop workstations is typically 10 Performance of workstations and PCs is rapidly improving my Laptop 60 Mflops s on Fortran 77 code As performance grows percent utilisation will decrease even further Organisations are reluctant to buy large supercomputers due to the large expense and short useful life span Motivations for Clusters The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs Workstation clusters are easier to integrate into existing networks than special parallel computers Install Linux from the local NFS copy look at NASA Beowulf MPPs require special HiPPi switches and interface hookups Many MetaComputers will be made from Clusters although most of the larger research efforts prefer to integrate different MPPs rather than clusters I e more output for less effort from the MetaComputing System itself I e getting the Bell Award Usage Usage depends on the class of the users As shown here Meteorology verse Psychology Motivation for using Clusters The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers mainly due to the non standard nature of many parallel systems Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing HPC platforms Use of clusters of workstations as a distributed compute resource is very cost effective incremental growth of system CPUs and disks are a lot cheaper but some of the better InterConnection cards like Myranet Gigabit etc are expensive Well almost try upgrading all your systems to PIIIs after you just brought all the PIIs with the wrong motherboards Do a few a week maybe 2 Cycle Stealing Cycle Stealing Usually a workstation will be owned by an individual group department or organisation they are dedicated to the exclusive use by the owners This brings problems when attempting to form a cluster of workstations for running distributed applications Typically there are three types of owner who use their workstations mostly for Unless it is a dedicated cluster like the TORC cluster If it is managed correctly that is 1 Sending and receiving mail and preparing documents 2 Software development edit compile debug and test cycle 3 Running compute intensive applications TORC runs too many different tests configuration I e not a stable platform Cycle Stealing Cluster computing aims to steal spare cycles from 1 and 2 to provide resources for 3 However this requires overcoming the ownership hurdle people are very protective of their workstations Usually requires an organizational mandate that computers are to be used in this way Management Software Software for managing clusters or metacomputers must handle many complex issues Heterogeneous environments computer and network hardware software OS protocols etc Resource Management CPUs disk arrays and sometimes long haul network connections Job scheduling Handling multiple schedulers at the same time Job allocation policy prioritisation Security and authentication Cycle stealing from desktop computers without impacting interactive use by owner Cycle Stealing Stealing cycles outside standard work hours e g overnight is easy stealing idle cycles during work hours without impacting interactive users both CPU and memory is much harder Management Software Fault tolerance Support for batch and interactive jobs Should support all programming paradigms sequential data parallel message passing shared memory threads etc Should support legacy applications User interface and job specification mechanism 3 Problems with Distributed Computing


View Full Document

UTK CS 594 - Grid/MetaComputing Lecture 7 Notes

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Grid/MetaComputing Lecture 7 Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Grid/MetaComputing Lecture 7 Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?