DOC PREVIEW
UW-Madison CS 739 - High-Performace Sorting on Networks of Workstations

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

High-Performace Sorting on Networks of WorkstationsMeenali RungtaCS 739: Distributed SystemsUniversity of Wisconsin, Madison1 March 2006Spring 20061 OverviewThe paper demonstrates that parallel sorting on a Networkof Workstations (NOW) is competitiveto sorting on large-scale SMPs.The clusters consist of commodity hardwareand software, and the authors also evaluate the suitabilityof existing hardware, interfaces and mechanisms to thesorting application.2 Problem Statement and Assump-tionsThe paper presents a NOW approach to the disk-to-disksorting problem. Thus the authors show that NOWs arewell-suited to I/O-intensive applications. They have a sin-gle application in mind, and hence the system is relativelyeasier to build and optimize. Also, the system is build onthe following assumptions:• The workstations in the cluster are connected to eachother by high-speed switched ethernet.• Though each workstation runs an independent unixbased OS, the disparate resources are unified undera distributed OS like GLUnix. The authors use thisprimarily as a parallel program launcher.• The communication layer in the nodes of the clus-ter is the lightweight ActiveMessage layer, built spe-cially to take advantage of the low latency and highbandwidth of switch-based networks. Normally,when a message arrives, TCP/IP buffers it till itreaches the application layer. The application thendecides what has to be done with the arriving mes-sage. Active Messages, on the other hand, let thesender specify a handler with the message whichis executed remotely and atomically with respect toother message arrivals. Since the various layers ofTCP/IP are not present, the round-trip latency of thenetwork is much less.• The authors measure the performance of NOW-Sortonly on key values with uniform distributions. Fur-ther, they assume that the initial number of recordson each workstation is equal.3 MethodologyThe authors develop a collection of sorting algorithms ofincreasing complexity, from a single-pass single node tothe final two-pass parallel algorithm. They then imple-ment the algorithm and study the system in detail to tuneand configure it for the most optimal performance.4 Design and Implementation4.1 One-Pass Single-Node SortingThe single-pass single node sort turns out to be the right‘building block’ for the sorting application. In particular,this implementation contributes in studying the following:• The optimal disk configuration: the authors developa library to enable them to stripe larger blocks on1faster disks, to make use of the heterogenity in diskhardware. This is expected to common phenomenonin large clusters, especially as once-deployed clus-ters have a mix of old and new disks in due course oftime.The study suggests that out of the existing hardware,4-disk configuration gives the best bandwidth. Itis notable that later, in studying parallel version ofsort, the authors find that due to communication anddisk traffic over a common S-bus interconnect, it be-comes a bottleneck and 2 disks over the wide SCSIturns out to be the best disk configuration. They how-ever, use the 2-disk disks over fast-narrow SCSI dueto hardware limitations. Such variablilty in optimalconfigurations of the variety of hardware in a clusterenvironment seems to be noteworthy.• The usefulness of software (OS) interface: Theyfind mmap() with madvise() adequate for read-ing records from disk, but need to develop a tool fordetermining the amount of available memory. Thetool is useful to determine the number of records thatcan be sorted with a one-pass sort.4.2 One-Pass Parallel SortingIt’s interesting to contrast the use of Active Messages ascompared to Distributed File System in this case. NOW-Sort doesn’t use a DFS at all while it uses distributed OS(GLUnix) mainly to launch parallel programs simultane-ously. The use of ActiveMessage layer utilizes locality ofdata - the input files are read from local disks, keys aresent to remote buckets if needed using the communica-tion layer, and final output files are written to local disksas well. A DFS, on the other hand, would read from re-mote files (unless information about locality is available),would send keys to remote buckets if needed, and writeto local disks. A DFS built on Petal would probably notget to even write final files locally, as the virtual disk in-terface exposed by Petal completely abstracts the locationinformation from a FS like Frangipani.The authors also exploit overlapping of the two opera-tions - reading from local disks and communicating withremote nodes. They develop synchronous, interleavedand threaded versions of the algorithm and observe thatthreaded (which overlaps reading and sending by meansof separate threads) performs the best on a host of diskconfigurations. They also note that the previous optimaldisk configuration of 4-disks doesn’t go well at all withcommunication involved via the same S-bus due to lat-ter’s saturation, and that reading from 2-disks and writingto all 4 gives the new best configuration. The saturationobserved for S-bus is pretty acute: the theoretical peak is80 MB/s, whereas the observed peak is 36 MB/s.4.3 Two-Pass Single-Node SortingThis consists of two phases: A ‘Create Runs’ phase wherethe one-pass sort is repeated to create multiple sortedruns on disks, and a ‘Merge’ phase where sorted runs aremerged into a single sorted file. Here the authors exploitthe overlapping between read of a subsequent run with thewrite of the previous run in the ‘Create Runs’ phase. Theyalso exploit the overlap between reading of records fromall individual runs for merging and writing of part of thefinal sorted file.4.4 Two-Pass Parallel SortingHere the authors explore overlapping reading and send-ing (as in one-pass parallel algorithm) and also readingand writing (as in two-pass single node sort). While allfour operations being overlapped gives best performancein case of 2 disks, the S-bus is a point of contention in caseof 4 disks (with read, write and network traffic all passingthrough it at the same time). Hence in 4-disks configura-tion, the best overlaps reading and sending, but performsreading and writing synchronously.5 EvaluationThe paper evaluates the system incrementally, for everysorting implementation and for various configurations.It’s interesting that NOWSort sets new record for Minute-Sort (6 GB sorted in one minute on 64 processors) whiledoing twice the amount of I/O (sorting being done in 2passes


View Full Document

UW-Madison CS 739 - High-Performace Sorting on Networks of Workstations

Documents in this Course
Load more
Download High-Performace Sorting on Networks of Workstations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High-Performace Sorting on Networks of Workstations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High-Performace Sorting on Networks of Workstations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?