CS 361A (Advanced Data Structures and Algorithms)Hashing for Massive/Streaming DataMassive Data SetsAlgorithmic ProblemsAlgorithmic IssuesStream Model of Computation“Toy” Example – Network MonitoringFrequency Related ProblemsExample 1 – Distinct ValuesNaïve ApproachMain Memory Approach Algorithm MMRandomized AlgorithmsExternal Memory ModelJustification?External Memory Algorithm?Algorithm EMProblem with Algorithm EMSampling-based ApproachesNegative Result for Sampling [Charikar, Chaudhuri, Motwani, Narasayya 2000]Scenario AnalysisProofProof (continued)Proof (conclusion)Streaming ModelNegative ResultRandomized ApproximationAnalysisBoosting AccuracySampling versus CountingDistinct Sampling for Streams [Gibbons 2001]Hash FunctionOverall IdeaAlgorithm DS (Distinct Sample)Slide 34ReferencesCS 361A 1CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)Lecture 15 (Nov 14, 2005)Hashing for Massive/Streaming DataRajeev MotwaniCS 361A2Hashing for Massive/Streaming DataHashing for Massive/Streaming DataNew TopicNovel hashing techniques + randomized data structuresMotivated by massive/streaming data applicationsGame PlanProbabilistic Counting: Flajolet-Martin & Frequency MomentsMin-HashingLocality-Sensitive HashingBloom FiltersConsistent HashingP2P Hashing…CS 361A3Massive Data SetsMassive Data SetsExamplesWeb (40 billion pages, each 1-10 KB, possibly 100TB of text)Human Genome (3 billion base pairs)Walmart Market-Basket Data (24 TB)Sloan Digital Sky Survey (40 TB)AT&T (300 million call-records per day)Presentation?Network Access (Web)Data Warehouse (Walmart)Secondary Store (Human Genome)Streaming (Astronomy, AT&T)CS 361A4Algorithmic ProblemsAlgorithmic ProblemsExamplesStatistics (median, variance, aggregates)Patterns (clustering, associations, classification)Query Responses (SQL, similarity)Compression (storage, communication)Novelty?Problem size – simplicity, near-linear timeModels – external memory, streamingScale of data – emergent behavior?CS 361A5Algorithmic IssuesAlgorithmic Issues Computational Model Streaming data (or, secondary memory) Bounded main memoryTechniquesNew paradigms neededNegative results and ApproximationRandomization Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)CS 361A6Stream Model of ComputationStream Model of Computation10111010011Increasing time Main Memory (Synopsis Data Structures)Data StreamMemory: poly(1/ε, log N)Query/Update Time: poly(1/ε, log N)N: # items so far, or window sizeε: error parameterCS 361A7““Toy” Example – Network MonitoringToy” Example – Network MonitoringRegisterMonitoringQueriesDSMSScratch StoreNetwork measurements,Packet traces,…IntrusionWarningsOnlinePerformanceMetricsArchiveLookupTablesCS 361A8 Frequency Related ProblemsFrequency Related Problems1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Find all elements with frequency > 0.1%Top-k most frequent elementsWhat is the frequency of element 3?What is the total frequency of elements between 8 and 14?Find elements that occupy 0.1% of the tail.Mean + Variance?Median?How many elements have non-zero frequency?Analytics on Packet Headers – IP AddressesAnalytics on Packet Headers – IP AddressesCS 361A9Example 1 – Distinct ValuesExample 1 – Distinct Values ProblemSequenceDomain Compute D(X) number of distinct values in XRemarksAssume stream size n is finite/known (e.g., n is window size)Domain could be arbitrary (e.g., text, tuples)Study impact of …different presentation modelsdifferent algorithmic models… and thereby understand model definitions...,x...,,x,xXn21}1...,,2,1,0{ mUCS 361A10Naïve ApproachNaïve ApproachCounter C(i) for each domain value iInitialize counters C(i) 0Scan X incrementing appropriate countersProblemMemory size M << nSpace O(m) – possibly m >> n(e.g., when counting distinct words in web crawl)In fact, Time O(m) – but tricks to do initialization?CS 361A11Main Memory ApproachMain Memory ApproachAlgorithm MMAlgorithm MMPick r = θ(n), hash function h:U [1..r]Initialize array A[1..r] and D = 0For each input value xiCheck if xi occurs in list stored at A[h(i)]If not, D D+1 and add xi to list at A[h(i)]Output DFor “random” h, few collisions & most list-sizes O(1)ThusSpace O(n)Time O(1) per item [Expected]CS 361A12Randomized AlgorithmsRandomized AlgorithmsLas Vegas (preceding algorithm)always produces right answerrunning-time is random variableMonte Carlo (will see later)running-time is deterministicmay produce wrong answer (bounded probability)Atlantic City (sometimes also called M.C.)worst of both worldsCS 361A13External Memory ModelExternal Memory ModelRequired when input X doesn’t fit in memoryM words of memoryInput size n >> MData stored on diskDisk block size B << MUnit time to transfer disk block to memoryMemory operations are freeCS 361A14Justification?Justification?Block read/write?Transfer rate ≈ 100 MB/sec (say)Block size ≈ 100 KB (say)Block transfer time << Seek timeThus – only count number of seeksLinear Scaneven better as avoids random seeksFree memory operations?Processor speeds – multi-GHzDisk seek time ≈ 0.01 secCS 361A15External Memory Algorithm?External Memory Algorithm?Question – Why not just use Algorithm MM?ProblemArray A does not fit in memory For each value, need a random portion of AEach value involves a disk block readThus – Ω(n) disk block accessesLinear time – O(n/B) in this modelCS 361A16Algorithm EMAlgorithm EMMerge SortPartition into M/B groupsSort each group (recursively)Merge groups using n/B block accesses(need to hold 1 block from each group in memory)Sorting Time – Compute D(X) – one more passTotal Time – EXERCISE – verify details/analysisnlogBnM/Bnn/B)log(1M/BCS 361A17Problem with Algorithm EMProblem with Algorithm EMNeed to sort and reorder blocks on diskDatabasesTuples with multiple attributesData might need to be ordered by attribute YAlgorithm EM reorders by attribute XIn any case, sorting is too expensiveAlternate ApproachSample portions of dataUse sample to estimate distinct valuesCS 361A18Sampling-based ApproachesSampling-based ApproachesNaïve samplingRandom Sample R (of size r) of n values in XCompute D(R)EstimatorNoteBenefit – sublinear space Cost – estimation error Why? –
View Full Document