UW-Madison CS 764 - A Cache-Sensitive Parallel External Sort

Unformatted text preview:

VLDB Journal, 4, 603-627 (1995), Stanley Y.W. Su, Editor 603 QVLDB AlphaSort: A Cache-Sensitive Parallel External Sort Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, and Dave Lomet Received September 8, 1994; revised version received, March 28, 1995; accepted March 28, 1995. Abstract. A new sort algorithm, called AlphaSort, demonstrates that commodity processors and disks can handle commercial batch workloads. Using commod- ity processors, memory, and arrays of SCSI disks, AlphaSort runs the industry- standard sort benchmark in seven seconds. This beats the best published record on a 32-CPU 32-disk Hypercube by 8:1. On another benchmark, AlphaSort sorted more than a gigabyte in one minute. AlphaSort is a cache-sensitive, memory- intensive sort algorithm. We argue that modern architectures require algorithm designers to re-examine their use of the memory hierarchy. AlphaSort uses clus- tered data structures to get good cache locality, file striping to get high disk band- width, QuickSort to generate runs, and replacement-selection to merge the runs. It uses shared memory multiprocessors to break the sort into subsort chores. Be- cause startup times are becoming a significant part of the total time, we propose two new benchmarks: (1) MinuteSort: how much can you sort in one minute, and (2) PennySort: how much can you sort for one penny. Key Words. sort, cache, disk, memory, striping, parallel, Alpha, Dec 7000. 1. Introduction In 1985, an informal group of 25 database experts from a dozen companies and universities defined three basic benchmarks to measure the transaction processing performance of computer systems. 1. DebitCredit: A market-basket of database reads and writes, terminal IO, and transaction commits to measure on-line transaction processing performance Chris Nyberg, M.S., is President, Ordinal Technology Corp., 20 Crestview Dr., Orinda, CA 94563, ny- [email protected], Tom Barclay, B.S., is Program Manager, Microsoft Corp., One Microsoft Way, Red- mond, WA 98052, [email protected], Zarka Cvetanovic, Ph.D., is Software Consultant Engineer, Dig- ital Equipment Corp., 60 Codman Hill Rd, Boxborough, MA 01717, [email protected], Jim Gray, Ph.D., is Senior Researcher, 310 Filbert St., San Francisco, CA 94133, [email protected], David Lomet, Ph.D., is Senior Researcher, Microsoft Corp., One Microsoft Way, Redmond, WA 98052, [email protected] . . (OLTP). This benchmark evolved to become the TPC-A transactions-per- second and dollars-per-transaction-per-second metrics (Gray, 1991). Scan: Copy one thousand 100-byte records from disk-to-disk with transaction protection. This simple mini-batch transaction measures the ability of a file system or database system to pump data through a user application. Sort: A disk-to-disk sort of one million, 100-byte records. This has become the standard test of batch and utility performance in the database community (Bitton, 1981; Tsukerman, 1986; Weinberger, 1986; Beck et al., 1988; Bagusto and Greipsland, 1989; Lorie and Young, 1989; Bagusto et al., 1990; Graefe, 1990; Gray, 1991; DeWitt et al., 1992; Graefe and Thakkar, 1992). Sort tests the processor's, IO subsystem's, and operating system's ability to move data. DebitCredit is a simple interactive transaction. Scan is a mini-batch transaction. Sort is an IO-intensive batch transaction. Together they cover a broad spectrum of basic commercial operations. 2. The Sort Benchmark and Prior Work on Sort The Datamation article (Anon-et-al., 1985) defined the sort benchmark as: • Input is a disk-resident file of one million 100-byte records. • Records have 10-byte key fields and can't be compressed. • The input record keys are in random order. • The output file must be a permutation of the input file sorted in key ascending order. The performance metric is the elapsed time of the following seven steps: (1) Launch the sort program. (2) Open the input file and create the output file. (3) Read the input file. (4) Sort the records in key-ascending order. (5) Write the output file. (6) Close the files. (7) Terminate the program. The implementation may use all the "mean tricks" that are typical of oper- ating systems utilities. It can access the files via low-level interfaces; it can use undocumented interfaces; and it can use as many disks, processors, and as much memory as it likes. Sort's price-performance metric normalizes variations in software and hardware configuration. The basic idea is to compute the 5-year cost of the hardware and software, and then prorate that cost for the elapsed time of the sort (Anon-et-al., 1985; Gray, 1991). A one-minute sort on a machine with a 5-year cost of one million dollars would cost 38 cents ($0.38).VLDB Journal 4(4) Nyberg: AlphaSort 605 Table 1. Published sort performance on the Datamation 100 MB benchmark in chronological order System Seconds S/sort(*) Tandem 3600 4.61 Beck 6000 1.92 Tsukerman + Tandem 980 1.25 .2 Weinberger + Cray 26 1.25 7.5 Kitsuregawa 320* 0.41 .2 Baugsto 180 0.23 .2 Graefe + Sequent 83 0.27 .5 Baugsto 40 0.26 1 DeWitt + Intel iPSC/2 58 0.37 1.0 DEC Alpha AXP 7000 9.1 0.022 .4 DEC Alpha AXP 4000 8.2 0.011 .2 DEC Alpha AXP 7000 7 0.014 .5 Cost MS* CPUs Disks 2 2 2 .1 4 4 3 6 1 1 1+ 1 16 16 8 4 100 100 32 32 1 16 2 14 3 28 Reference Tsukerman, 1986 Beck, 1988 Salzberg, 1990 Weingerger, 1986 Kitsuregawa, 1989 Baugsto, 1989 Graefe, 1992 Baugsto, 1989 Dewitt, 1992 Nyberg, 1993 Nyberg, 1993 Nyberg, 1993 Extrapolations marked by (*). Prices are estimated. In 1985, as reported by Tsukerman, typical systems needed 15 minutes to perform this sort benchmark (Bitton, 1981; Anon-et-al., 1985; Tsukerman, 1986). As a super-computer response to Tsukerman's efforts, Weinberger (1986) of ATF wrote a program to read a disk file into memory, sort it using replacement-selection as records arrived, and then write the sorted data to a file. This code postulated 8-byte keys, a natural size for the Cray, and made some other simplifications. The disks transferred at 8 MB/s, so you might guess that it took 12.5 seconds to read and 12.5 seconds to write for a grand total of 25 seconds. However, there was about one second of overhead in setup,


View Full Document
Download A Cache-Sensitive Parallel External Sort
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Cache-Sensitive Parallel External Sort and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Cache-Sensitive Parallel External Sort 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?