UW-Madison CS 739 - High-Performace Sorting on Networks of Workstations - D2829176

Home> Schools> University of Wisconsin, Madison> (CS) > CS 739> High-Performace Sorting on Networks of Workstations

UW-Madison CS 739 - High-Performace Sorting on Networks of Workstations

School name University of Wisconsin, Madison

Pages 3

Download Save

Unformatted text preview:

High Performace Sorting on Networks of Workstations Meenali Rungta CS 739 Distributed Systems University of Wisconsin Madison 1 March 2006 Spring 2006 1 Overview The paper demonstrates that parallel sorting on a Network of Workstations NOW is competitive to sorting on largescale SMPs The clusters consist of commodity hardware and software and the authors also evaluate the suitability of existing hardware interfaces and mechanisms to the sorting application reaches the application layer The application then decides what has to be done with the arriving message Active Messages on the other hand let the sender specify a handler with the message which is executed remotely and atomically with respect to other message arrivals Since the various layers of TCP IP are not present the round trip latency of the network is much less 2 Problem Statement and Assumptions The authors measure the performance of NOW Sort only on key values with uniform distributions Further they assume that the initial number of records on each workstation is equal The paper presents a NOW approach to the disk to disk sorting problem Thus the authors show that NOWs are 3 Methodology well suited to I O intensive applications They have a single application in mind and hence the system is relatively The authors develop a collection of sorting algorithms of easier to build and optimize Also the system is build on increasing complexity from a single pass single node to the following assumptions the final two pass parallel algorithm They then implement the algorithm and study the system in detail to tune The workstations in the cluster are connected to each and configure it for the most optimal performance other by high speed switched ethernet Though each workstation runs an independent unix based OS the disparate resources are unified under a distributed OS like GLUnix The authors use this primarily as a parallel program launcher 4 4 1 Design and Implementation One Pass Single Node Sorting The single pass single node sort turns out to be the right The communication layer in the nodes of the clus building block for the sorting application In particular ter is the lightweight ActiveMessage layer built spe this implementation contributes in studying the following cially to take advantage of the low latency and high bandwidth of switch based networks Normally The optimal disk configuration the authors develop when a message arrives TCP IP buffers it till it a library to enable them to stripe larger blocks on 1 faster disks to make use of the heterogenity in disk hardware This is expected to common phenomenon in large clusters especially as once deployed clusters have a mix of old and new disks in due course of time of separate threads performs the best on a host of disk configurations They also note that the previous optimal disk configuration of 4 disks doesn t go well at all with communication involved via the same S bus due to latter s saturation and that reading from 2 disks and writing The study suggests that out of the existing hardware to all 4 gives the new best configuration The saturation 4 disk configuration gives the best bandwidth It observed for S bus is pretty acute the theoretical peak is is notable that later in studying parallel version of 80 MB s whereas the observed peak is 36 MB s sort the authors find that due to communication and disk traffic over a common S bus interconnect it becomes a bottleneck and 2 disks over the wide SCSI 4 3 Two Pass Single Node Sorting turns out to be the best disk configuration They how This consists of two phases A Create Runs phase where ever use the 2 disk disks over fast narrow SCSI due the one pass sort is repeated to create multiple sorted to hardware limitations Such variablilty in optimal runs on disks and a Merge phase where sorted runs are configurations of the variety of hardware in a cluster merged into a single sorted file Here the authors exploit environment seems to be noteworthy the overlapping between read of a subsequent run with the The usefulness of software OS interface They write of the previous run in the Create Runs phase They find mmap with madvise adequate for read also exploit the overlap between reading of records from ing records from disk but need to develop a tool for all individual runs for merging and writing of part of the determining the amount of available memory The final sorted file tool is useful to determine the number of records that can be sorted with a one pass sort 4 4 4 2 One Pass Parallel Sorting Two Pass Parallel Sorting Here the authors explore overlapping reading and sending as in one pass parallel algorithm and also reading and writing as in two pass single node sort While all four operations being overlapped gives best performance in case of 2 disks the S bus is a point of contention in case of 4 disks with read write and network traffic all passing through it at the same time Hence in 4 disks configuration the best overlaps reading and sending but performs reading and writing synchronously It s interesting to contrast the use of Active Messages as compared to Distributed File System in this case NOWSort doesn t use a DFS at all while it uses distributed OS GLUnix mainly to launch parallel programs simultaneously The use of ActiveMessage layer utilizes locality of data the input files are read from local disks keys are sent to remote buckets if needed using the communication layer and final output files are written to local disks as well A DFS on the other hand would read from remote files unless information about locality is available would send keys to remote buckets if needed and write to local disks A DFS built on Petal would probably not get to even write final files locally as the virtual disk interface exposed by Petal completely abstracts the location information from a FS like Frangipani The authors also exploit overlapping of the two operations reading from local disks and communicating with remote nodes They develop synchronous interleaved and threaded versions of the algorithm and observe that threaded which overlaps reading and sending by means 5 Evaluation The paper evaluates the system incrementally for every sorting implementation and for various configurations It s interesting that NOWSort sets new record for MinuteSort 6 GB sorted in one minute on 64 processors while doing twice the amount of I O sorting being done in 2 passes owing to insufficient memory It also sets a new Datamation Benchmark record of

View Full Document


School:
Email:
New Password:
Confirm Password:

UW-Madison CS 739 - High-Performace Sorting on Networks of Workstations

Sign up for free to view:

Please select your school