UW-Madison CS 764 - Parallel Database Systems - The Future of High Performance Database Processing - D315675

Home> Schools> University of Wisconsin, Madison> (CS) > CS 764> Parallel Database Systems - The Future of High Performance Database Processing

UW-Madison CS 764 - Parallel Database Systems - The Future of High Performance Database Processing

School name University of Wisconsin, Madison

Course Cs 764- Topics in Database Management Systems

Pages 26

Download Save

Unformatted text preview:

1Parallel Database Systems:The Future of High Performance Database Processing1David J. DeWitt2 Jim GrayComputer Sciences Department San Francisco Systems CenterUniversity of Wisconsin Digital Equipment Corporation1210 W. Dayton St. 455 Market St. 7’th floorMadison, WI. 53706 San Francisco, CA. 94105-2403dewitt @ cs.wisc.edu Gray @ SFbay.enet.dec.comJanuary 1992Abstract: Parallel database machine architectures have evolved from the use of exotichardware to a software parallel dataflow architecture based on conventional shared-nothinghardware. These new designs provide impressive speedup and scaleup when processingrelational database queries. This paper reviews the techniques used by such systems, and surveyscurrent commercial and research systems.1. IntroductionHighly parallel database systems are beginning to displace traditional mainframecomputers for the largest database and transaction processing tasks. The success of thesesystems refutes a 1983 paper predicting the demise of database machines [BORA83]. Ten yearsago the future of highly-parallel database machines seemed gloomy, even to their staunchestadvocates. Most database machine research had focused on specialized, often trendy, hardwaresuch as CCD memories, bubble memories, head-per-track disks, and optical disks. None of thesetechnologies fulfilled their promises; so there was a sense that conventional cpus, electronicRAM, and moving-head magnetic disks would dominate the scene for many years to come. Atthat time, disk throughput was predicted to double while processor speeds were predicted toincrease by much larger factors. Consequently, critics predicted that multi-processor systemswould soon be I/O limited unless a solution to the I/O bottleneck were found.While these predictions were fairly accurate about the future of hardware, the critics werecertainly wrong about the overall future of parallel database systems. Over the last decadeTeradata, Tandem, and a host of startup companies have successfully developed and marketedhighly parallel database machines. 1 Appeared in Communications of the ACM, Vol. 36, No. 6, June 19922 This research was partially supported by the Defense Advanced Research Projects Agency under contract N00039-86-C-0578,by the National Science Foundation under grant DCR-8512862, and by research grants from Digital Equipment Corporation,IBM, NCR, Tandem, and Intel Scientific Computers.2 Why have parallel database systems become more than a research curiosity? Oneexplanation is the widespread adoption of the relational data model. In 1983 relational databasesystems were just appearing in the marketplace; today they dominate it. Relational queries areideally suited to parallel execution; they consist of uniform operations applied to uniform streamsof data. Each operator produces a new relation, so the operators can be composed into highlyparallel dataflow graphs. By streaming the output of one operator into the input of anotheroperator, the two operators can work in series giving pipelined parallelism. By partitioning theinput data among multiple processors and memories, an operator can often be split into manyindependent operators each working on a part of the data. This partitioned data and executiongives partitioned parallelism (Figure 1).The dataflow approach to database system design needs a message-based client-serveroperating system to interconnect the parallel processes executing the relational operators. This inturn requires a high-speed network to interconnect the parallel processors. Such facilities seemedexotic a decade ago, but now they are the mainstream of computer architecture. The client-serverparadigm using high-speed LANs is the basis for most PC, workstation, and workgroup software.Those same client-server mechanisms are an excellent basis for distributed database technology.Source DataScanSortSource DataScanSortSource DataScanSortSource DataScanSortSource DataScanSortMergepipeline parallelism partitioned data allows partitioned parallelismFigure 1. The dataflow approach to relational operators gives both pipelined and partitionedparallelism. Relational data operators take relations (uniform sets of records) as input and producerelations as outputs. This allows them to be composed into dataflow graphs that allow pipeline parallelism(left) in which the computation of one operator proceeds in parallel with another, and partitionedparallelism in which operators (sort and scan in the diagram at the right) are replicated for each datasource, and the replicas execute in parallel.Mainframe designers have found it difficult to build machines powerful enough to meetthe CPU and I/O demands of relational databases serving large numbers of simultaneous users orsearching terabyte databases. Meanwhile, multi-processors based on fast and inexpensivemicroprocessors have become widely available from vendors including Encore, Intel, NCR,nCUBE, Sequent, Tandem, Teradata, and Thinking Machines. These machines provide moretotal power than their mainframe counterparts at a lower price. Their modular architectures3enable systems to grow incrementally, adding MIPS, memory, and disks either to speedup theprocessing of a given job, or to scaleup the system to process a larger job in the same time.In retrospect, special-purpose database machines have indeed failed; but, paralleldatabase systems are a big success. The successful parallel database systems are built fromconventional processors, memories, and disks. They have emerged as major consumers of highlyparallel architectures, and are in an excellent position to exploit massive numbers of fast-cheapcommodity disks, processors, and memories promised by current technology forecasts.A consensus on parallel and distributed database system architecture has emerged. Thisarchitecture is based on a shared-nothing hardware design [STON86] in which processorscommunicate with one another only by sending messages via an interconnection network. Insuch systems, tuples of each relation in the database are partitioned (declustered) across diskstorage units3 attached directly to each processor. Partitioning allows multiple processors to scanlarge relations in parallel without needing any exotic

View Full Document