Berkeley COMPSCI 262A - A Quick Introduction to Data Stream Algorithmics

Unformatted text preview:

A Quick Introduction to Data Stream Algorithmics Minos Garofalakis Yahoo Research UC Berkeley minos acm org Streams A Brave New World Traditional DBMS data stored in finite persistent data sets Data Streams distributed continuous unbounded rapid time varying noisy Data Stream Management variety of modern applications 2 Network monitoring and traffic engineering Sensor networks Telecom call detail records Network security Financial applications Manufacturing processes Web logs and clickstreams Other massive data sets A Quick Intro to Data Stream Algorithmics CS262 Massive Data Streams Data is continuously growing faster than our ability to store or index it There are 3 Billion Telephone Calls in US each day 30 Billion emails daily 1 Billion SMS IMs Scientific data NASA s observation satellites generate billions of readings each per day IP Network Traffic up to 1 Billion packets per hour per router Each ISP has many hundreds routers Whole genome sequences for many species now available each megabytes to gigabytes in size 3 A Quick Intro to Data Stream Algorithmics CS262 Massive Data Stream Analysis Must analyze this massive data Scientific research monitor environment species System management spot faults drops failures Business intelligence marketing rules new offers For revenue protection phone fraud service abuse Else why even measure this data 4 A Quick Intro to Data Stream Algorithmics CS262 Example IP Network Data Networks are sources of massive data the metadata per hour per IP router is gigabytes Fundamental problem of data stream analysis Too much information to store or transmit So process data as it arrives One pass small space the data stream approach Approximate answers to many questions are OK if there are guarantees of result quality 5 A Quick Intro to Data Stream Algorithmics CS262 IP Network Monitoring Application SNMP RMON NetFlow records Peer Network Operations Center NOC Converged IP MPLS Core Source 10 1 0 2 18 6 7 1 13 9 4 3 15 2 2 9 12 4 3 8 10 5 1 3 11 1 0 6 19 7 1 2 Destination 16 2 3 7 12 4 0 3 11 6 8 2 17 1 2 1 14 8 7 4 13 0 0 1 10 3 4 5 16 5 5 8 Duration 12 16 15 19 26 27 32 18 Bytes 20K 24K 20K 40K 58K 100K 300K 80K Protocol http http http http http ftp ftp ftp Example NetFlow IP Session Data Enterprise Networks FR ATM IP VPN DSL Cable Broadband Internet Access Networks Voice over IP 24x7 IP packet flow data streams at network elements Truly massive streams arriving at rapid rates PSTN AT T Sprint collect 1 Terabyte of NetFlow data each day Often shipped off site to data warehouse for off line analysis 6 A Quick Intro to Data Stream Algorithmics CS262 Packet Level Data Streams Single 2Gb sec link say avg packet size is 50bytes Number of packets sec 5 million Time per packet 0 2 microsec If we only capture header information per packet src dest IP time no of bytes etc at least 10bytes Space per second is 50Mb Space per day is 4 5Tb per link ISPs typically have hundreds of links Analyzing packet content streams whole different ballgame 7 A Quick Intro to Data Stream Algorithmics CS262 Network Monitoring Queries Back end Data Warehouse DBMS Oracle DB2 What are the top most frequent 1000 source dest pairs seen over the last month Off line analysis slow expensive Network Operations Center NOC Peer How many distinct source dest pairs have been seen by both R1 and R2 but not R3 R1 R2 Enterprise Networks DSL Cable Networks 8 Set Expression Query R3 SELECT COUNT R1 source R2 dest FROM R1 R2 WHERE R1 dest R2 source PSTN SQL Join Query Extra complexity comes from limited space and time Solutions exist for these and other problems A Quick Intro to Data Stream Algorithmics CS262 Real Time Data Stream Analysis Network Operations Center NOC DSL Cable Networks PSTN BGP Must process network streams in real time and one pass Critical NM tasks fraud DoS attacks SLA violations IP Network Real time traffic engineering to improve utilization Tradeoff result accuracy vs space time communication Fast responses small space time Minimize use of communication resources 9 A Quick Intro to Data Stream Algorithmics CS262 Sensor Networks Wireless sensor networks becoming ubiquitous in environmental monitoring military applications Many 100s 103 106 sensors scattered over terrain Sensors observe and process a local stream of readings Measure light temperature pressure Detect signals movement radiation Record audio images motion 10 A Quick Intro to Data Stream Algorithmics CS262 Query sensornet through a remote base station Sensor nodes have severe resource constraints Limited battery power memory processor radio range Communication is the major source of battery drain transmitting a single bit of data is equivalent to 800 instructions Madden et al 02 base station root coordinator 11 A Quick Intro to Data Stream Algorithmics CS262 http www intel com research exploratory motes htm Sensornet Querying Application Lecture Outline Motivation Streaming Applications Centralized Stream Processing Basic streaming models and tools Stream synopses and applications Sampling sketches Conclusions 12 A Quick Intro to Data Stream Algorithmics CS262 Data Streaming Model Underlying signal One dimensional array A 1 N with values A i all initially zero Multi dimensional arrays as well e g row major Signal is implicitly represented via a stream of update tuples j th update is x c j implying A x A x c j c j can be 0 0 Goal Compute functions on A subject to Small space Fast processing of updates Fast function computation Complexity arises from massive length and domain size N of streams 13 A Quick Intro to Data Stream Algorithmics CS262 Example IP Network Signals Number of bytes packets sent by a source IP address during the day 2 32 sized one d array increment only Number of flows between a source IP destination IP address pair during the day 2 64 sized two d array increment only aggregate packets into flows Number of active flows per source IP address 2 32 14 sized one d array increment and decrement A Quick Intro to Data Stream Algorithmics CS262 Streaming Model Special Cases Time Series Model Only x th update updates A x i e A x c x Cash Register Model Arrivals Only Streams c x is always 0 Typically c x 1 so we see a multi set of items in one pass Example x 3 y 2 x 2 encodes x the arrival of 3 copies of item x y 2 copies of y then 2 copies of x Could represent e g packets on a network power usage 15 A Quick Intro to Data Stream Algorithmics CS262 Streaming Model Special Cases Turnstile Model Arrivals


View Full Document

Berkeley COMPSCI 262A - A Quick Introduction to Data Stream Algorithmics

Download A Quick Introduction to Data Stream Algorithmics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Quick Introduction to Data Stream Algorithmics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Quick Introduction to Data Stream Algorithmics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?