UW-Madison CS 764 - THE DESIGN OF THE POSTGRES STORAGE SYSTEM

Unformatted text preview:

THE DESIGN OF THE POSTGRES STORAGE SYSTEMMichael StonebrakerEECS DepartmentUniversity of CaliforniaBerkeley, Ca., 94720AbstractThis paper presents the design of the storage system for the POSTGRES data base systemunder construction at Berkeley. It is novel in several ways. First, the storage manager supportstransaction management but does so without using a conventional write ahead log (WAL). Infact, there is no code to run at recovery time, and consequently recovery from crashes is essen-tially instantaneous. Second, the storage manager allows a user to optionally keep the entire pasthistory of data base objects by closely integrating an archival storage system to which historicalrecords are spooled. Lastly, the storage manager is consciously constructed as a collection ofasynchronous processes. Hence, a large monolithic body of code is avoided and opportunities forparallelism can be exploited. The paper concludes with a analysis of the storage system whichsuggests that it is performance competitive with WAL systems in many situations.1. INTRODUCTIONThe POSTGRES storage manager is the collection of modules that provide transactionmanagement and access to data base objects. The design of these modules was guided by threegoals which are discussed in turn below. The first goal was to provide transaction managementwithout the necessity of writing a large amount of specialized crash recovery code. Such code ishard to debug, hard to write and must be error free. If it fails on an important client of the datamanager, front page news is often the result because the client cannot access his data base and hisbusiness will be adversely affected. To achieve this goal, POSTGRES has adopted a novelstorage system in which no data is ever overwritten; rather all updates are turned into insertions.The second goal of the storage manager is to accomodate the historical state of the data baseon a write-once-read-many (WORM) optical disk (or other archival medium) in addition to thecurrent state on an ordinary magnetic disk. Consequently, we have designed an asynchronousprocess, called the vacuum cleaner which moves archival records off magnetic disk and onto anarchival storage system.The third goal of the storage system is to take advantage of specialized hardware. In partic-ular, we assume the existence of non-volatile main memory in some reasonable quantity. Suchmemory can be provide through error correction techniques and a battery-back-up scheme orfrom some other hardware means. In addition, we expect to have a few low level machineinstructions available for specialized uses to be presently explained. We also assume that archi-tectures with several processors will become increasingly popular. In such an environment, thereThis research was sponsored by the Navy Electronics Systems Command under contract N00039-84-C-0039.1is an opportunity to apply multiple processors to running the DBMS where currently only one isutilized. This requires the POSTGRES DBMS to be changed from the monolithic single-flow-of-control architectures that are prevalent today to one where there are many asynchronousprocesses concurrently performing DBMS functions. Processors with this flavor include theSequent Balance System [SEQU85], the FIREFLY, and SPUR [HILL85].The remainder of this paper is organized as follows. In the next section we present thedesign of our magnetic disk storage system. Then, in Section 3 we present the structure and con-cepts behind our archival system. Section 4 continues with some thoughts on efficient indexes forarchival storage. Lastly, Section 5 presents a performance comparison between our system andthat of a conventional storage system with a write-ahead log (WAL) [GRAY78].2. THE MAGNETIC DISK SYSTEM2.1. The Transaction SystemDisk records are changed by data base transactions, each of which is given a unique tran-saction identifier (XID). XIDs are 40 bit unsigned integers that are sequentially assigned start-ing at 1. At 100 transactions per second (TPS), POSTGRES has sufficient XIDs for about 320years of operation. In addition, the remaining 8 bits of a composite 48 bit interaction identifier(IID) is a command identifier (CID) for each command within a transaction. Consequently, atransaction is limited to executing at most 256 commands.In addition there is a transaction log which contains 2 bits per transaction indicating itsstatus as:committedabortedin progressA transaction is started by advancing a counter containing the first unassigned XID and using thecurrent contents as a XID. The coding of the log has a default value for a transaction as ‘‘in pro-gress’’ so no specific change to the log need be made at the start of a transaction. A transaction iscommitted by changing its status in the log from ‘‘in progress’’ to ‘‘committed’’ and placing theappropriate disk block of the log in stable storage. Moreover, any data pages that were changedon behalf of the transaction must also be placed in stable storage. These pages can either beforced to disk or moved to stable main memory if any is available. Similarly, a transaction isaborted by changing its status from ‘‘in progress’’ to ‘‘aborted’’.The tail of the log is that portion of the log from the oldest active transaction up to thepresent. The body of the log is the remainder of the log and transactions in this portion cannot be‘‘in progress’’ so only 1 bit need be allocated. The body of the log occupies a POSTGRES rela-tion for which a special access method has been built. This access method places the status of65536 transactions on each POSTGRES 8K disk block. At 1 transaction per second, the bodyincreases in size at a rate of 4 Mbytes per year. Consequently, for light applications, the log forthe entire history of operation is not a large object and can fit in a sizeable buffer pool. Undernormal circumstances several megabytes of memory will be used for this purpose and the statusof all historical transactions can be readily found without requiring a disk read.In heavier applications where the body of the log will not fit in main memory, POSTGRESapplies an optional compression technique. Since most transactions commit, the body of the logcontains almost all ‘‘commit’’ bits. Hence, POSTGRES has an optional bloom


View Full Document
Download THE DESIGN OF THE POSTGRES STORAGE SYSTEM
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view THE DESIGN OF THE POSTGRES STORAGE SYSTEM and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view THE DESIGN OF THE POSTGRES STORAGE SYSTEM 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?