DOC PREVIEW
UH COSC 6340 - Querying and Mining Data Streams

This preview shows page 1-2-3-4-5-6-7-8-58-59-60-61-62-63-64-65-66-118-119-120-121-122-123-124-125 out of 125 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 125 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1GarofalakisGarofalakis, Gehrke,, Gehrke,RastogiRastogi, VLDB’02 #, VLDB’02 #Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One LookYou Only Get One LookA TutorialA TutorialMinos Minos Garofalakis Garofalakis Johannes Gehrke Johannes Gehrke Rajeev Rastogi Rajeev Rastogi Bell LaboratoriesCornell University2GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #OutlineOutline• Introduction & Motivation– Stream computation model, Applications• Basic stream synopses computation – Samples, Equi-depth histograms, Wavelets• Mining data streams– Decision trees, clustering, association rules• Sketch-based computation techniques– Self-joins, Joins, Wavelets, V-optimal histograms• Advanced techniques – Sliding windows, Distinct values, Hot lists• Future directions & Conclusions3GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Processing Data Streams: MotivationProcessing Data Streams: Motivation• A growing number of applications generate streams of data– Performance measurements in network monitoring and traffic management– Call detail records in telecommunications– Transactions in retail chains, ATM operations in banks– Log records generated by Web Servers– Sensor network data• Application characteristics– Massive volumes of data (several terabytes)– Records arrive at a rapid rate •Goal:Mine patterns, process queries and compute statistics on data streams in real-time4GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Data Streams: Computation ModelData Streams: Computation Model• A data stream is a (massive) sequence of elements:• Stream processing requirements– Single pass: Each record is examined at most once – Bounded storage: Limited Memory (M) for storing synopsis– Real-time: Per record processing time (to maintain synopsis) must be lowStream ProcessingEngine(Approximate)AnswerSynopsis in MemoryData Streamsnee ,...,15GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Network Management ApplicationNetwork Management Application• Network Management involves monitoring and configuring network hardware and software to ensure smooth operation– Monitor link bandwidth usage, estimate traffic demands– Quickly detect faults, congestion and isolate root cause– Load balancing, improve utilization of network resourcesNetwork Operations Center Network MeasurementsAlarms6GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #IP Network Measurement DataIP Network Measurement Data• IP session data (collected using Cisco NetFlow)• AT&T collects 100 GBs of NetFlow data each day!Source Destination Duration Bytes Protocol10.1.0.2 16.2.3.7 12 20K http18.6.7.1 12.4.0.3 16 24K http13.9.4.3 11.6.8.2 15 20K http15.2.2.9 17.1.2.1 19 40K http12.4.3.8 14.8.7.4 26 58K http10.5.1.3 13.0.0.1 27 100K ftp11.1.0.6 10.3.4.5 32 300K ftp19.7.1.2 16.5.5.8 18 80K ftp7GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Network Data ProcessingNetwork Data Processing• Traffic estimation– How many bytes were sent between a pair of IP addresses?– What fraction network IP addresses are active? – List the top 100 IP addresses in terms of traffic• Traffic analysis– What is the average duration of an IP session?– What is the median of the number of bytes in each IP session?•Fraud detection– List all sessions that transmitted more than 1000 bytes– Identify all sessions whose duration was more than twice the normal• Security/Denial of Service – List all IP addresses that have witnessed a sudden spike in traffic– Identify IP addresses involved in more than 1000 sessions8GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Data Stream Processing Data Stream Processing AlgorithmsAlgorithms• Generally, algorithms compute approximate answers– Difficult to compute answers accurately with limited memory• Approximate answers - Deterministic bounds– Algorithms only compute an approximate answer, but bounds on error • Approximate answers - Probabilistic bounds– Algorithms compute an approximate answer with high probability• With probability at least , the computed answer is within a factor of the actual answer• Single-pass algorithms for processing streams also applicable to (massive) terabyte databases!δ−1ε9GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #OutlineOutline• Introduction & Motivation• Basic stream synopses computation– Samples: Answering queries using samples, Reservoir sampling– Histograms: Equi-depth histograms, On-line quantile computation– Wavelets: Haar-wavelet histogram construction & maintenance• Mining data streams• Sketch-based computation techniques• Advanced techniques• Future directions & Conclusions10GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data– For a fast approx answer, apply “modified” query to S–Example:select agg from R where R.e is odd(n=12)–If aggis avg, return average of odd elements in S –If aggis count, return average over all elements e in S of• n if e is odd• 0 if e is evenUnbiased: For expressions involving count, sum, avg: the estimatoris unbiased, i.e., the expected value of the answer is the actual answerData stream: 9 3 5 2 7 1 6 5 8 4 9 1Sample S: 9 5 1 8answer: 5answer: 12*3/4 =911GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Probabilistic Guarantees Probabilistic Guarantees • Example: Actual answer is within 5 ± 1 with prob ≥ 0.9• Use Tail Inequalities to give probabilistic bounds on returned answer–Markov Inequality– Chebyshev’s Inequality– Hoeffding’s Inequality– Chernoff Bound12GarofalakisGarofalakis, Gehrke, Rastogi, VLDB’02 #, Gehrke, Rastogi, VLDB’02 #Tail Inequalities Tail Inequalities • General bounds on tail probabilityof a random variable (that is, probability that a random variable deviates far from its expectation)• Basic Inequalities:Let X be a random variable with expectation and variance Var[X]. Then for any µεµµεProbabilitydistributionTail


View Full Document

UH COSC 6340 - Querying and Mining Data Streams

Download Querying and Mining Data Streams
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Querying and Mining Data Streams and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Querying and Mining Data Streams 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?