Stanford CS 361A - Hashing for Massive/Streaming Data

Unformatted text preview:

CS 361A (Advanced Data Structures and Algorithms)Hashing for Massive/Streaming DataMassive Data SetsAlgorithmic ProblemsAlgorithmic IssuesStream Model of Computation“Toy” Example – Network MonitoringFrequency Related ProblemsExample 1 – Distinct ValuesNaïve ApproachMain Memory Approach Algorithm MMRandomized AlgorithmsExternal Memory ModelJustification?External Memory Algorithm?Algorithm EMProblem with Algorithm EMSampling-based ApproachesNegative Result for Sampling [Charikar, Chaudhuri, Motwani, Narasayya 2000]Scenario AnalysisProofProof (continued)Proof (conclusion)Streaming ModelNegative ResultRandomized ApproximationAnalysisBoosting AccuracySampling versus CountingDistinct Sampling for Streams [Gibbons 2001]Hash FunctionOverall IdeaAlgorithm DS (Distinct Sample)Slide 34ReferencesCS 361A 1CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)Lecture 15 (Nov 14, 2005)Hashing for Massive/Streaming DataRajeev MotwaniCS 361A2Hashing for Massive/Streaming DataHashing for Massive/Streaming DataNew TopicNovel hashing techniques + randomized data structuresMotivated by massive/streaming data applicationsGame PlanProbabilistic Counting: Flajolet-Martin & Frequency MomentsMin-HashingLocality-Sensitive HashingBloom FiltersConsistent HashingP2P Hashing…CS 361A3Massive Data SetsMassive Data SetsExamplesWeb (40 billion pages, each 1-10 KB, possibly 100TB of text)Human Genome (3 billion base pairs)Walmart Market-Basket Data (24 TB)Sloan Digital Sky Survey (40 TB)AT&T (300 million call-records per day)Presentation?Network Access (Web)Data Warehouse (Walmart)Secondary Store (Human Genome)Streaming (Astronomy, AT&T)CS 361A4Algorithmic ProblemsAlgorithmic ProblemsExamplesStatistics (median, variance, aggregates)Patterns (clustering, associations, classification)Query Responses (SQL, similarity)Compression (storage, communication)Novelty?Problem size – simplicity, near-linear timeModels – external memory, streamingScale of data – emergent behavior?CS 361A5Algorithmic IssuesAlgorithmic Issues Computational Model Streaming data (or, secondary memory) Bounded main memoryTechniquesNew paradigms neededNegative results and ApproximationRandomization Complexity Measures Memory Time per item (online, real-time) # Passes (linear scan in secondary memory)CS 361A6Stream Model of ComputationStream Model of Computation10111010011Increasing time Main Memory (Synopsis Data Structures)Data StreamMemory: poly(1/ε, log N)Query/Update Time: poly(1/ε, log N)N: # items so far, or window sizeε: error parameterCS 361A7““Toy” Example – Network MonitoringToy” Example – Network MonitoringRegisterMonitoringQueriesDSMSScratch StoreNetwork measurements,Packet traces,…IntrusionWarningsOnlinePerformanceMetricsArchiveLookupTablesCS 361A8 Frequency Related ProblemsFrequency Related Problems1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Find all elements with frequency > 0.1%Top-k most frequent elementsWhat is the frequency of element 3?What is the total frequency of elements between 8 and 14?Find elements that occupy 0.1% of the tail.Mean + Variance?Median?How many elements have non-zero frequency?Analytics on Packet Headers – IP AddressesAnalytics on Packet Headers – IP AddressesCS 361A9Example 1 – Distinct ValuesExample 1 – Distinct Values ProblemSequenceDomain Compute D(X) number of distinct values in XRemarksAssume stream size n is finite/known (e.g., n is window size)Domain could be arbitrary (e.g., text, tuples)Study impact of …different presentation modelsdifferent algorithmic models… and thereby understand model definitions...,x...,,x,xXn21}1...,,2,1,0{  mUCS 361A10Naïve ApproachNaïve ApproachCounter C(i) for each domain value iInitialize counters C(i) 0Scan X incrementing appropriate countersProblemMemory size M << nSpace O(m) – possibly m >> n(e.g., when counting distinct words in web crawl)In fact, Time O(m) – but tricks to do initialization?CS 361A11Main Memory ApproachMain Memory ApproachAlgorithm MMAlgorithm MMPick r = θ(n), hash function h:U  [1..r]Initialize array A[1..r] and D = 0For each input value xiCheck if xi occurs in list stored at A[h(i)]If not, D D+1 and add xi to list at A[h(i)]Output DFor “random” h, few collisions & most list-sizes O(1)ThusSpace O(n)Time O(1) per item [Expected]CS 361A12Randomized AlgorithmsRandomized AlgorithmsLas Vegas (preceding algorithm)always produces right answerrunning-time is random variableMonte Carlo (will see later)running-time is deterministicmay produce wrong answer (bounded probability)Atlantic City (sometimes also called M.C.)worst of both worldsCS 361A13External Memory ModelExternal Memory ModelRequired when input X doesn’t fit in memoryM words of memoryInput size n >> MData stored on diskDisk block size B << MUnit time to transfer disk block to memoryMemory operations are freeCS 361A14Justification?Justification?Block read/write?Transfer rate ≈ 100 MB/sec (say)Block size ≈ 100 KB (say)Block transfer time << Seek timeThus – only count number of seeksLinear Scaneven better as avoids random seeksFree memory operations?Processor speeds – multi-GHzDisk seek time ≈ 0.01 secCS 361A15External Memory Algorithm?External Memory Algorithm?Question – Why not just use Algorithm MM?ProblemArray A does not fit in memory For each value, need a random portion of AEach value involves a disk block readThus – Ω(n) disk block accessesLinear time – O(n/B) in this modelCS 361A16Algorithm EMAlgorithm EMMerge SortPartition into M/B groupsSort each group (recursively)Merge groups using n/B block accesses(need to hold 1 block from each group in memory)Sorting Time – Compute D(X) – one more passTotal Time – EXERCISE – verify details/analysisnlogBnM/Bnn/B)log(1M/BCS 361A17Problem with Algorithm EMProblem with Algorithm EMNeed to sort and reorder blocks on diskDatabasesTuples with multiple attributesData might need to be ordered by attribute YAlgorithm EM reorders by attribute XIn any case, sorting is too expensiveAlternate ApproachSample portions of dataUse sample to estimate distinct valuesCS 361A18Sampling-based ApproachesSampling-based ApproachesNaïve samplingRandom Sample R (of size r) of n values in XCompute D(R)EstimatorNoteBenefit – sublinear space Cost – estimation error Why? –


View Full Document

Stanford CS 361A - Hashing for Massive/Streaming Data

Documents in this Course
Load more
Download Hashing for Massive/Streaming Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hashing for Massive/Streaming Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hashing for Massive/Streaming Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?