Unformatted text preview:

Feifei LiFall 2008(Many slides were made available by Ke Yi)CIS 5930 Advanced Topics in Data Management2Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industryExamples (2002):• Phone: AT&T 20TB phone call database, wireless tracking• Consumer: WalMart 70TB database, buying patterns • WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day• Geography: NASA satellites generate 1.2TB per day3Example: LIDAR Terrain Data• Massive (irregular) point sets (1-10m resolution)– Becoming relatively cheap and easy to collect• Appalachian Mountains between 50GB and 5TB• Exceeds memory limit and needs to be stored on disk4Example: Network Flow Data• AT&T IP backbone generates 500 GB per day• Gigascope: A data stream management system– Compute certain statistics• Can we do computation without storing the data?5Random Access Machine Model• Standard theoretical model of computation:– Infinite memory– Uniform access cost• Simple model crucial for success of computer industryRAM6Hierarchical Memory• Modern machines have complicated memory hierarchy– Levels get larger and slower further away from CPU– Data moved between levels using large blocksL1L2RAM7Slow I/O– Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes)– Important to store/access data to take advantage of blocks (locality)• Disk access is 106times slower than main memory accesstrackmagnetic surfaceread/write arm“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)4835 1915 5748 41258Scalability Problems• Most programs developed in RAM-model– Run on large datasets becauseOS moves blocks as needed• Moderns OS utilizes sophisticated paging and prefetching strategies– But if program makes scattered accesses even good OS cannot take advantage of block accessScalability problems!data sizerunning time9Solution 1: Buy More Memory• Expensive• (Probably) not scalable– Growth rate of data is higher than the growth of memory10Solution 2: Cheat! (by random sampling)• Provide approximate solution for some problems– average, frequency of an element, etc.• What if we want the exact result?• Many problems can’t be solved by sampling– maximum, and all problems mentioned laterSolution 3: Using the Right Computation Model• External Memory Model• Streaming Model• Uncertain Data Model12N= # of items in the problem instanceB = # of items per disk blockM = # of items that fit in main memoryT = # of items in outputI/O: Move block between memory and diskWe assume (for convenience) that M >B2DPMBlock I/OExternal Memory Model13Fundamental BoundsInternal External• Scanning: N• Sorting: N log N• Permuting• Searching:• Note:– Linear I/O: O(N/B)– Permuting not linear– Permuting and sorting bounds are equal in all practical cases– B factor VERY important: – Cannot sort optimally with search treeNBlogBNBNBMlogBNNBNBNBNBM log}log,min{BNBNBMNNN2log14Queues and Stacks• Queue:– Maintain push and pop blocks in main memory O(1/B) Push/Pop operations• Stack:– Maintain push/pop block in main memoryO(1/B) Push/Pop operationsPushPop15Puzzle #1: Majority Counting• A huge file of characters stored on disk• Question: Is there a character that appears > 50% of the time• Solution 1: sort + scan– A few passes (O(logM/BN)): will come to it later• Solution 2: divide-and-conquer– Load a chunk in to memory: N/M chunks– Count them, return majority– The overall majority must be the majority in >50% chunks– Iterate until < M– Very few passes (O(logMN)), geometrically decreasing• Solution 3: O(1) memory, 2 passes (answer to be posted later)b a e c a d a ad a a e a b a a f a g b16Sorting• <M/B sorted lists (queues) can be merged in O(N/B) I/OsM/B blocks in main memory17Sorting• Merge sort:– Create N/M memory sized sorted lists– Repeatedly merge lists together Θ(M/B) at a time phases using I/Os each  I/Os)(BNO)(logMNBMO)log(BNBNBMO)(MN)/(BMMN))/((2BMMN12-Way Sort: Requires 3 Buffers• Phase 1: PREPARE. – Read a page, sort it, write it.– only one buffer page is used• Phase 2, 3, …, etc.: MERGE:– three buffer pages used.Main memory buffersINPUT 1INPUT 2OUTPUTDiskDiskTwo-Way External Merge Sort• Idea: Divide and conquer: sort subfiles and merge into larger sortsInput file1-page runs2-page runs4-page runs8-page runsPASS 0PASS 1PASS 2PASS 393,46,29,4 8,7 5,6 3,123,4 5,62,6 4,9 7,81,3 22,34,64,78,91,35,6 22,34,46,78,91,23,561,22,33,44,56,67,8Two-Way External Merge Sort• Costs for pass :all pages• # of passes :height of tree• Total cost : product of aboveInput file1-page runs2-page runs4-page runs8-page runsPASS 0PASS 1PASS 2PASS 393,46,29,4 8,7 5,6 3,123,4 5,62,6 4,9 7,81,3 22,34,64,78,91,35,6 22,34,46,78,91,23,561,22,33,44,56,67,8Two-Way External Merge Sort• Each pass we read + write each page in file.• N/B pages in file => 2N/B• Number of passes• So total cost is: 1/log2 BN  1/log/22BNBNInput file1-page runs2-page runs4-page runs8-page runsPASS 0PASS 1PASS 2PASS 393,46,29,4 8,7 5,6 3,123,4 5,62,6 4,9 7,81,3 22,34,64,78,91,35,6 22,34,46,78,91,23,561,22,33,44,56,67,8External Merge Sort• What if we had more buffer pages?• How do we utilize them wisely ?- Two main ideas !Phase 1 : PrepareM/B Main memory buffersINPUT 1INPUT M/BDiskDiskINPUT 2. . .. . .•Construct as large as possible starter lists.Phase 2 : MergeCompose as many sorted sublists into one long sorted list.M/B Main memory buffersINPUT 1INPUT M/B-1OUTPUTDiskDiskINPUT 2. . .. . .General External Merge Sort• To sort a file with N/B pages using M/B buffer pages:– Pass 0: use M/B buffer pages. Produce sorted runs of M/B pages each.– Pass 1, 2, …, etc.: merge M/B-1 runs.  N B/M/B Main memory buffersINPUT 1INPUT M/B-1OUTPUTDiskDiskINPUT 2. . .. . .. . .* How can we utilize more than 3 buffer pages?26Selection Algorithm• In internal memory (deterministic) quicksort split element (median) found using linear time selection• Selection algorithm: Finding i’th element in sorted order1) Select


View Full Document

FSU CIS 5930r - lecture1

Download lecture1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?