USC CSCI 585 - indexing - D2736656

Home> Schools> University of Southern California> Computer Science (CSCI) > CSCI 585> indexing

USC CSCI 585 - indexing

School name University of Southern California

Pages 35

Download Save

Unformatted text preview:

107 Di 82 0989 ORGANIZATION AND MAINTENANCE OF LARGE ORDERED INDICES by R Bayer and E McCreight Mathematical and Information Sciences Report No 20 Mathematical and Information Sciences Laboratory BOEING SCIENTIFIC RESEARCH LABORATORIES July 1970 i0 ABSTRACT Organization and maintenance of an index for a dynamic random access file is considered It is assumed that the index must be kept on some pseudo random access backup store like a disc or a drum The index organization described allows retrieval insertion and deletion of keys in time proportional to index and k lOgkl where I is the size of the is a device dependent natural number such that the per formance of the scheme becomes near optimal least 50 but generally much higher Storage utilization is at The pages of the index are organized in a special data structure so called B trees The scheme is analyzed performance bounds are obtained and a near optimal k is computed Experiments have been performed with indices up to i00 000 keys An index of size 15 000 i00 000 can be maintained with an average of 9 at least 4 transactions per second on an IBM 360 44 with a 2311 disc Key Words and Phrases Data structures random access files dynamic index maintenance key insertion key deletion key retrieval paging information retrieval CR Categories 3 70 3 73 3 74 109 i Introduction In this paper we consider the problem of organizing and maintaining an index for a dynamically we mean a collection of index elements which are pairs size physically adjacent data items namely a key information The key the associated information By an index changing random access file x identifies x x e of fixed and some associated a unique element in the index is typically a pointer to a record or a collection of records in a random access file For this paper the associated information is of no further interest We assume that the index itself is so voluminous that only rather small parts of it can be kept in main store at one time of the index must be kept on some backup store Thus the bulk The class of backup stores considered are pseudo random access d uides which have a rather long access or wait time as core store and opposed to a true random access device like a rather high data rate once the transmission sequential data has been initiated are fixed and moving head discs of physically Typical pseudo random access devices drums and data cells Since the data file itself changes it must be possible not only to search the index and to retrieve elements but also to delete and to insert keys more organization accurately index elements economically described in this paper always allows retrieval and deletion of keys in time proportional is the size of the index which describes and k to lOgkI The index insertiol or better where I is a device dependent natural number the page size such that the performance of the maintenance ii0 and retrieva scheme becomes near optimal In more illustrative terms theoretical ments show that it is possible with an average of 9 to maintain retrievals analysis and actual experian index of size 15000 insertions and deletions per second in real time on an IBM 360 44 with a 2311 disc as backup store to our theoretical analysis According it should be possible to maintain an index of size 1500000 with at least two transactions per second on such a configuration in real time The index is organized in pages of fixed size capable of holding up to 2k keys but pages need only be partially filled blocks of information The pages Pages are the transferred between main store and backup store themselves are the nodes of a rather specialized a so called B tree described in the next section tree In this paper these trees grow and contract in only one way namely nodes split off a brother or two brothers are merged or catenated and catenation processes toward the root into a single node The splitting are initiated at the leaves only and propagate If the root node splits a new root must be introduced and this is the only way in which the height of the tree can increase The opposite process occurs if the tree contracts There are of course many competitive schemes e g hash coding for organizing an index For a large class of applications presented in this paper offers significant advantages the scheme over others 1tl i Storage utilization is at least 50 at any time and should be considerably better in the average ii Storage is requested and released as the file grows and contracts There is no congestion problem or degradation of performance if the storage occupancy is very high iii The natural order of the keys is maintained and allows processing based on that order like find predecessors and successors search the file sequentially to answer queries skip delete retrieve a number of records starting from a glven key iv If retrievals insertions and deletions come in batches very efficient essentially sequential processing of the index is possible by presorting the transactions on their keys and by using a simple prepaging algorithm 2 B Trees Def 2 1 tree T Let h 0 be an integer is in the class k h k a natural number B trees of if T A directed is either empty h 0 or has the following properties i Each path from the root to any leaf has the same length also called the ii height of T i e h number of nodes in path Each node except the root and the leaves has at least sons i11 h k 1 The root is a leaf or has at least two sons Each node has at most Number of Nodes in B Trees 2k 1 sons Let N nun maximal number of nodes in a B tree and N max T c k h be the minimal and Then K2 k l h l l N i 2 k l 0 k l l k l h 2 i mln for h 2 This also holds for Nma x h i 2k l i i i 0 h i Similarly one obtains 2k l h l h i Upper and lower bounds for the number N T of nodes of T E k h are given by N T 0 if T T k O 2 k l h l i N T i Note that the classes k h 1 2k l hl need not be disjoint 2 1 otherwise 113 3 The Data Structure and Retrieval Algorithm To repeat a B tree the pages on which T e m k h the index is stored are the nodes of and can hold up to 2k keys In addition the data structure for the index has the following properties i Each page holds between k and 2k keys index elements except the root page which may hold between 1 and 2k keys ii Let the number of keys on a page iii Then P has Within each …

View Full Document


School:
Email:
New Password:
Confirm Password:

USC CSCI 585 - indexing

Sign up for free to view:

Please select your school