DOC PREVIEW
USC CSCI 585 - indexing

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 35 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

107 Di-82-0989 ORGANIZATION AND MAINTENANCE OF LARGE ORDERED INDICES by R. Bayer and E. McCreight Mathematical and Information Sciences Report No. 20 Mathematical and Information Sciences Laboratory BOEING SCIENTIFIC RESEARCH LABORATORIES July 1970i0 ABSTRACT Organization and maintenance of an index for a dynamic random access file is considered. It is assumed that the index must be kept on some pseudo random access backup store like a disc or a drum. The index organization described allows retrieval, insertion, and deletion of keys in time proportional to lOgkl where I is the size of the index and k is a device dependent natural number such that the per- formance of the scheme becomes near optimal. Storage utilization is at least 50% but generally much higher. The pages of the index are organized in a special data-structure, so-called B-trees. The scheme is analyzed~ performance bounds are obtained, and a near optimal k is computed. Experiments have been performed with indices up to i00,000 keys. An index of size 15,000 (i00,000) can be maintained with an average of 9 (at least 4) transactions per second on an IBM 360/44 with a 2311 disc. Key Words and Phrases: Data structures, random access files, dynamic index maintenance, key insertion, key deletion, key retrieval, paging, information retrieval. CR Categories: 3.70, 3.73, 3.74.109 i. Introduction In this paper we consider the problem of organizing and maintaining an index for a dynamically changing random access file. By an index we mean a collection of index elements which are pairs (x,e) of fixed size physically adjacent data items, namely a key x and some associated information ~. The key x identifies a unique element in the index, the associated information is typically a pointer to a record or a collection of records in a random access file. For this paper the associated information is of no further interest. We assume that the index itself is so voluminous that only rather small parts of it can be kept in main store at one time. Thus the bulk of the index must be kept on some backup store. The class of backup stores considered are pseudo random access d~uides which have a rather long access or wait time--as opposed to a true random access device like core store--and a rather high data rate once the transmission of physically sequential data has been initiated. Typical pseudo random access devices are: fixed and moving head discs, drums, and data cells. Since the data file itself changes, it must be possible not only to search the index and to retrieve elements, but also to delete and to insert keys--more accurately index elements--economically. The index organization described in this paper always allows retrieval, insertiol~, and deletion of keys in time proportional to lOgkI or better, where I is the size of the index, and k is a device dependent natural number which describes the page size such that the performance of the maintenanceii0 and retrieva] scheme becomes near optimal. In more illustrative terms theoretical analysis and actual experi- ments show that it is possible to maintain an index of size 15000 with an average of 9 retrievals, insertions, and deletions per second in real time on an IBM 360/44 with a 2311 disc as backup store. According to our theoretical analysis, it should be possible to maintain an index of size 1500000 with at least two transactions per second on such a configuration in real time. The index is organized in pages of fixed size capable of holding up to 2k keys, but pages need only be partially filled. Pages are the blocks of information transferred between main store and backup store. The pages themselves are the nodes of a rather specialized tree, a so-called B-tree, described in the next section. In this paper these trees grow and contract in only one way, namely nodes split off a brother, or two brothers are merged or "catenated" into a single node. The splitting and catenation processes are initiated at the leaves only and propagate toward the root. If the root node splits, a new root must be introduced, and this is the only way in which the height of the tree can increase. The opposite process occurs if the tree contracts. There are, of course, many competitive schemes, e.g., hash-coding, for organizing an index. For a large class of applications the scheme presented in this paper offers significant advantages over others:1tl i) ii) iii) iv) Storage utilization is at least 50% "at any time and should be considerably better in the average. Storage is requested and released as the file grows and con- tracts. There is no congestion problem or degradation of performance if the storage occupancy is very high. The natural order of the keys is maintained and allows pro- cessing based on that order like: find predecessors and successors; search the file sequentially to answer queries; skip, delete, retrieve a number of records starting from a glven key. If retrievals, insertions, and deletions come in batches, very efficient essentially sequential processing of the index is possible by presorting the transactions on their keys and by using a simple prepaging algorithm.2. B-Trees Def. 2.1: Let h ~ 0 be an integer, k a natural number. A directed tree T is in the class ~(k,h) of B-trees if T is either empty (h=0) or has the following properties: i) ii) Each path from the root to any leaf has the same length h, also called the height of T, i.e., h = number of nodes in path. Each node except the root and the leaves has at least k + 1 sons. The root is a leaf or has at least two sons. i11) Each node has at most 2k + 1 sons. Number of Nodes in B-Trees: Let N . and N nun max maximal number of nodes in a B-tree T c ~(k,h). be the minimal and Then ( "'" (k+l) h-2) K2 ((k+l) h-l-l) N . = i + 2 (k+l) 0 + (k+l) l + + = i + ~- mln for h >_ 2. This also holds for h


View Full Document

USC CSCI 585 - indexing

Download indexing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view indexing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view indexing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?