Chico CSCI 693 - Sybase IQ Multiplex – Designed For Analytics - D2797145

Home> Schools> California State University, Chico> Computer Science (CSCI) > CSCI 693> Sybase IQ Multiplex – Designed For Analytics

Chico CSCI 693 - Sybase IQ Multiplex – Designed For Analytics

School name California State University, Chico

Course Csci 693- Research Methods in Computer Science

Pages 4

Download Save

Unformatted text preview:

Sybase IQ Multiplex – Designed For Analytics Roger MacNicol Blaine French Sybase Inc 561 Virginia Road Concord, MA 01742 [email protected] [email protected] Abstract The internal design of database systems has traditionally given primacy to the needs of transactional data. A radical re-evaluation of the internal design giving primacy to the needs of complex analytics shows clear benefits in large databases for both single servers and in multi-node shared-disk grid computing. This design supports the trend to keep more years of more finely grained data online by ameliorating the data explosion problem. 1. Background Historically, databases have been designed around the requirement to support large numbers of small concurrent updates. The primary design criterion has been to minimize the portion of stored data that must be locked for exclusive access and the length of time that locks are held. Consequently, when the writer of data has primacy, the internal design generally adheres to the following rules 1. Data should be stored in rows to enable a single disk i/o to write the modified data out to the table. 2. Data should be stored on small page sizes to minimize the i/o cost and portion of disk locked for exclusive access. 3. The database must support the imperfect notion of isolation levels to enable reports to run in acceptable time while negotiating held locks. 4. Few columns should be indexed because locks on tree structures may deny access to more rows than row-page locks and increase the time locks are held. 5. Data pages should typically not be compressed because of poor amortization of the compression cost. Rows typically do not compress well because of the mix of data types stored adjacently. 6. Adding or dropping a column or index may be expensive since page storage may need to be updated for all rows. 7. Searched update statements may be relatively expensive since the entire row must be read and written for a single column to be updated. Given these consequences, IQ’s designers started by asking one fundamental question “What would a database look like internally if it were designed from the ground up for complex analytics on massive amounts of data”? It was understood that a reader-friendly analytics server must also support many concurrent update streams and, increasingly, it would be an operational data store rather than a store for cleansed and summarized data. But, data update streams in such a server are likely to be well-bounded. Once the primary criterion for the internal design becomes the performance of complex analytics, rather than row-oriented updates, a radical volte-face in design becomes self-obvious. 2. Designed for analytics Typical analytical queries access relatively few columns of the storage-dominating fact tables and may access a notable proportion of the rows stored in the fact tables. While CPU performance and available cache memory has increased dramatically with 64-bit servers and lower memory prices, disk performance has not kept up and, consequently, disk-bound performance is typical for many analytics. So, were design primacy to be given to complex analytics, many of the previous rules are reversed as follows: Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004 12271. Data would be stored in columns, not rows, so only the columns required to answer a query need be read – in effect every possible ad-hoc query could have performance akin to a covering index. Since data is only written once but read many times any increase in load time would be more than offset by improved query performance. Column storage would enhance main cache efficiency since commonly used columns would be more likely to be cached. 2. Data would be stored in a large page size so that many cells of a column can be retrieved in a single read and the larger page size plays to technological changes in how physical disk reads are structured. Traditional DBMS cannot use large pages sizes since each read drags along unneeded columns in the row. 3. The database would use page-level snapshot versioning so that data modification happens without interfering with running reports. Consequently, there would be no need for the notion of isolation levels – every query would see an internally consistent database state while navigating a nearly lockless environment. 4. Every column could be indexed: data will be written once and read many times so the cost will be amortized. Column storage at the physical layer greatly simplifies parallel index updates and adding an index requires only that column to be read, not the entire row. 5. Data could be compressed on disk. The compression cost would be well amortized and the homogenous datatype of a stored column would offer optimal compression. 6. Adding or dropping a column or index to reflect changing business requirements would be cheap, as no other data would be accessed. 7. Searched updates would be relatively cheap since only the columns being modified would need to be read and written. From these design principles, IQ was developed from the ground up to be a 64-bit DBMS using kernel threads giving design primacy to the needs of complex analytics without compromising load performance. 3. Designed for scalability (Multiplex) Sybase IQ optimizes workloads across multiple servers through Sybase IQ-M, which is a multi-node, shared-storage, parallel database system whose design focus is on large-scale data warehousing workloads. IQ-M offers grid-computing through multiple IQ instances running on multiple server nodes connected to a single shared IQ data store, which can comprise multiple disk arrays. Each node (instance) sees the entire database and has direct physical access to it, unlike the horizontally partitioned databases (such as MPP (massively parallel processors) with shared-nothing architecture). There are two types of IQ-M nodes, writer (one per Multiplex) and reader (all other nodes in the Multiplex). 1. The

View Full Document


School:
Email:
New Password:
Confirm Password:

Chico CSCI 693 - Sybase IQ Multiplex – Designed For Analytics

Sign up for free to view:

Please select your school