DOC PREVIEW
UCF COT 4810 - External Memory Algorithms and Data Structures

This preview shows page 1-2-3-4-29-30-31-32-33-60-61-62-63 out of 63 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 63 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

External Memory Algorithms and Data Structures:Dealing withMassive DataJEFFREY SCOTT VITTERDuke UniversityData sets in large applications are often too massive to fit completely inside thecomputer’s internal memory. The resulting input/output communication (or I/O)between fast internal memory and slower external memory (such as disks) can be amajor performance bottleneck. In this article we survey the state of the art in the designand analysis of external memory (or EM) algorithms and data structures, where thegoal is to exploit locality in order to reduce the I/O costs. We consider a variety of EMparadigms for solving batched and online problems efficiently in external memory. Forthe batched problem of sorting and related problems such as permuting and fast Fouriertransform, the key paradigms include distribution and merging. The paradigm of diskstriping offers an elegant way to use multiple disks in parallel. For sorting, however,Categories and Subject Descriptors: B.4.3 [Input /Output and DataCommunications]: Interconnections—parallel I/O; E.1 [Data Structures]: graphsand networks, trees; E.5 [Files]: sorting/searching; F.1.1 [Computation by AbstractDevices]: Models of Computation—bounded action devices, relations between models;F.2.2 [Analysis of Algorithms and Problem Complexity]: NonnumericalAlgorithms and Problems—computations on discrete structures, geometrical problemsand computations, sorting and searching; H.2.2 [Database Management]: PhysicalDesign—access methods; H.2.8 [Database Management]: Database Applications—spatial databases and GIS; H.3.2 [Information Storage and Retrieval]: InformationStorage—file organization; H.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—information filtering, search processGeneral Terms: Algorithms, Design, Experimentation, Performance, TheoryAdditional Key Words and Phrases: Batched, block, B-tree, disk, dynamic, extendiblehashing, external memory, hierarchical memory, I/O, multidimensional access methods,multilevel memory, online, out-of-core, secondary storage, sortingThis work was supported in part by Army Research Office MURI grant DAAH04-96-1-0013 and by NationalScience Foundation research grants CCR-9522047, EIA-9870734, and CCR-9877133.Part of this work was done at BRICS, University of Aarhus, Aarhus, Denmark and at INRIA, SophiaAntipolis, France. Earlier, shorter versions of some of this material appeared in Vitter [1998; 1999a; 1999b;1999c].Author’s address: Department of Computer Science, Levine Science Research Center, Duke University Box90129, Durham, NC 27708-0129, e-mail: [email protected], URL: http://www.cs.duke.edu/ ˜jsv.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works, requires prior specific permission and/or a fee. Permissions maybe requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212)869-0481, or [email protected]°2001 ACM 0360-5611/01/0700-0001 $5.00ACM Computing Surveys, Vol. 33, No. 2, June 2001, pp. 209–271.210 Jeffrey Scott Vitterdisk striping can be nonoptimal with respect to I/O, so to gain further improvements wediscuss distribution and merging techniques for using the disks independently. We alsoconsider useful techniques for batched EM problems involving matrices (such as matrixmultiplication and transposition), geometric data (such as finding intersections andconstructing convex hulls), and graphs (such as list ranking, connected components,topological sorting, and shortest paths). In the online domain, canonical EMapplications include dictionary lookup and range searching. The two important classesof indexed data structures are based upon extendible hashing and B-trees. Theparadigms of filtering and bootstrapping provide a convenient means in online datastructures to make effective use of the data accessed from disk. We also reexaminesome of the above EM problems in slightly different settings, such as when the dataitems are moving, when the data items are variable-length (e.g., text strings), or whenthe allocated amount of internal memory can change dynamically. Programming toolsand environments are available for simplifying the EM programming task. During thecourse of the survey, we report on some experiments in the domain of spatial databasesusing the TPIE system (transparent parallel I/O programming environment). Thenewly developed EM algorithms and data structures that incorporate the paradigms wediscuss are significantly faster than methods currently used in practice.1. INTRODUCTION1.1. BackgroundFor reasons of economy, general-purposecomputer systems usually contain a hier-archy of memory levels, each level withits own cost and performance character-istics. At the lowest level, CPU registersand caches are built with the fastest butmost expensive memory. For internal mainmemory, dynamic random access memory(DRAM) is typical. At a higher level, in-expensive but slower magnetic disks areused for external mass storage, and evenslower but larger-capacity devices suchas tapes and optical disks are used forarchival storage. Figure 1 depicts a typicalmemory hierarchy and its characteristics.Most modern programming languagesare based upon a programming model inwhich memory consists of one uniform ad-dress space. The notion of virtual memoryallows the address space to be far largerthan what can fit in the internal memoryof the computer. Programmers have a nat-ural tendency to assume that all memoryreferences require the same access time.In many cases, such an assumption is rea-sonable (or at least doesn’t do any harm),especially when the data sets are not large.The utility and elegance of this program-ming model are to a large extent why ithas flourished, contributing to the produc-tivity of the software industry.However, not all memory references arecreated equal. Large address spaces spanmultiple levels of the memory hierar-chy, and accessing the data in the low-est levels of memory is orders of


View Full Document

UCF COT 4810 - External Memory Algorithms and Data Structures

Documents in this Course
Spoofing

Spoofing

25 pages

CAPTCHA

CAPTCHA

18 pages

Load more
Download External Memory Algorithms and Data Structures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view External Memory Algorithms and Data Structures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view External Memory Algorithms and Data Structures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?