MIT 6 830 - Requirements for Science Data Bases and SciDB - D2509996

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 830> Requirements for Science Data Bases and SciDB

DOC PREVIEW

MIT 6 830 - Requirements for Science Data Bases and SciDB

School name Massachusetts Institute of Technology

Course 6 830- Database Systems

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

!CIDR Perspectives 2009 Requirements for Science Data Bases and SciDB Michael Stonebraker, MIT Jacek Becla, SLAC David Dewitt, Microsoft Kian-Tat Lim, SLAC David Maier, Portland State University Oliver Ratzesberger, eBay, Inc. Stan Zdonik, Brown University Abstract: For the past year, we have been assembling requirements from a collection of scientific data base users from astronomy, particle physics, fusion, remote sensing, oceanography, and biology. The intent has been to specify a common set of requirements for a new science data base system, which we call SciDB. In addition, we have discovered that very complex business analytics share most of the same requirements as “big science”. We have also constructed a partnership of companies to fund the development of SciDB, including eBay, the Large Synoptic Survey Telescope (LSST), Microsoft, the Stanford Linear Accelerator Center (SLAC) and Vertica. Lastly, we have identified two “lighthouse customers” (LSST and eBay) who will run the initial system, once it is constructed. In this paper, we report on the requirements we have identified and briefly sketch some of the SciDB design. I INTRODUCTION XLDB-1 in October 2007 brought together a collection of “big science” and commercial Internet users with extreme data base requirements. Also present were a collection of vendors and David DeWitt and Michael Stonebraker. The users complained about the inadequacy of current commercial DBMS offerings. DeWitt and Stonebraker countered with the fact that various researchers in the DBMS community have been working on science databases for years and have even built prototypes (e.g. Sequoia 2000 with Postgres [1], Paradise [2], the Sloan Digital Sky Survey [3], and extensions to MonetDB [4]). Moreover, they also said “if you can define a common set of requirements across several science disciplines, then we will try to build it.” The result was a meeting at Asilomar in March 2008 between a collection of science users and a collection of DBMS researchers to define requirements, followed by a more detailed design exercise over the summer. Additional use cases were solicited, and parallel fund raising was carried out. This paper presents the results of this requirements exercise in Section 2 and sketches some of the SciDB design. Intermingled are a collection of research topics that require attention. It concludes with summary of the state of the project in Section 3. II REQUIREMENTS These requirements come from particle physics (the LHC project at CERN, the BaBar project at SLAC and Fermilab), biology and remote sensing applications (Pacific Northwest National Laboratory), remote sensing (University of California at Santa Barbara), astronomy (Large Synoptic Survey Telescope), oceanography (Oregon Health & Science University and the Monterey Bay Aquarium Research Institute), and eBay. There is a general realization in these communities that the past practice (build custom software for each new project from the bare metal on up) will not work in the future. The software stack is getting too big, too hard to build and too hard to maintain. Hence, the community seems willing to get behind a single project in the DBMS area. They also realize that science DBMSs are a “zero billion dollar” industry. Hence, getting the attention of the large commercial vendors is simply not going to occur. 2.1 Data Model While most scientific users can use relational tables and have been forced to do so by current systems, we can find only a few users for whom tables are a natural data model that closely matches their data. Few are satisfied with SQL as the interface language. Although the Sloan Digital Sky Survey has been very successful in the astronomy area, they had perhaps the world’s best support engineer (Jim Gray) helping them. Also, a follow-on project, PanSTARRS, is actively engaged in extending the system to meet their needs [5]. The Sequoia 2000 project realized in the mid 1990s that their users wanted an array data model, and that simulating arrays on top of tables was difficult and resulted in poor performance. A similar conclusion was reached in the ASAP prototype [6] which found that the performance penalty of simulating arrays on top of tables was around two orders of magnitude. It appears that arrays are a natural data model for a significant subset of science users (specifically astronomy, oceanography, fusion and remote sensing). Moreover, a table with a primary key is merely a one-dimensional array. Hence, an array data model can subsume the needs of users who are happy with tables.!CIDR Perspectives 2009 Seemingly, biology and genomics users want graphs and sequences. They will be happy with neither a table nor an array data model. Chemistry users are in the same situation. Lastly, users with solid modelling applications want a mesh data model [7] and will be unhappy with tables or arrays. The net result is that “one size will not fit all”, and science users will need a mix of specialized DBMSs. Our project is exploring an array data model, primarily because it makes a considerable subset of the community happy and is easier to build than a mesh model. We support a multi-dimensional, nested array model with array cells containing records, which in turn can contain components that are multi-dimensional arrays. Specifically, arrays can have any number of dimensions, which may be named. Each dimension has contiguous integer values between 1 and N (the high-water-mark). Each combination of dimension values defines a cell. Every cell has the same data type(s) for its value(s), which is one or more scalar values, and/or ones or more arrays. An array A with dimensions I and J and values x and y would be addressed as: A[ 7, 8] — indicates the contents of the (7, 8)th cell A[I = 7, J = 8] — more verbose notation for the (7, 8)th cell A[7, 8].x — indicates the x value of the contents of the (7, 8)th cell Like SQL an array can be defined, and then multiple instances can be created. The basic syntax for defining an array is: define ArrayType ({name =Type-1}) ({dname}) The dimensions of the array, which must be integer-valued, are inside the second (…). The value(s) of the array are inside the first (…). Each value has a name and a data type, which can be either an array or a scalar. For example, consider a 2-D remote sensing array with each element consisting of 3 different types

View Full Document