Lecture 21 Stream Data Management Nov 14 2007 ChengXiang Zhai Many slides are taken from Jennifer Widom s presentation about the STREAM CS511 Advanced Database Management Systems project 1 Data Streams Traditional DBMS data stored in finite persistent data sets New applications data as multiple continuous rapid time varying data streams Network monitoring and traffic engineering Security applications Telecom call records Financial applications Web logs and click streams Sensor networks Manufacturing processes CS511 Advanced Database Management Systems 2 Making Things Concrete Database two streams of mobile call records Outgoing connectionID caller start end Incoming connectionID callee start end Query language SQL FROM clauses can refer to streams and or relations CS511 Advanced Database Management Systems 3 Query Example 1 Find all outgoing calls longer than 2 minutes relational selection SELECT O connectionID O caller FROM Outgoing O WHERE O end O start 2 Result requires unbounded storage Can provide result as data stream CS511 Advanced Database Management Systems 4 Query Example 2 Pair up callers and callees relational join SELECT O caller I callee FROM Outgoing O Incoming I WHERE O connectionID I connectionID Can still provide result as data stream Requires unbounded temporary storage without additional assumptions CS511 Advanced Database Management Systems 5 Query Example 3 Find total connection time for each caller relational grouping and aggregation SELECT O caller sum O end O start FROM Outgoing O GROUP BY O caller Cannot provide result in append only stream CS511 Advanced Database Management Systems 6 Challenges Multiple continuous rapid time varying streams of data Queries may be continuous not just one time Evaluated continuously as stream data arrives Answer updated over time Queries may be complex Beyond element at a time processing Beyond stream at a time processing CS511 Advanced Database Management Systems 7 Using Traditional Database User Application Query Query Result Result Loader CS511 Advanced Database Management Systems 8 New Approach for Data Streams User Application Register Query Result Stream Query Processor CS511 Advanced Database Management Systems 9 New Approach for Data Streams User Application Register Query Result Stream Query Processor Data Stream Management System DSMS Scratch Space Memory and or Disk CS511 Advanced Database Management Systems 10 DBMS versus DSMS CS511 Advanced Database Management Systems 11 DBMS versus DSMS Persistent relations CS511 Advanced Database Management Systems Transient streams and persistent relations 12 DBMS versus DSMS Persistent relations Transient streams and persistent relations One time queries Continuous queries CS511 Advanced Database Management Systems 13 DBMS versus DSMS Persistent relations Transient streams and persistent relations One time queries Continuous queries Random access CS511 Advanced Database Management Systems Sequential access 14 DBMS versus DSMS Persistent relations Transient streams and persistent relations One time queries Continuous queries Random access Access plan determined by query processor and physical DB design CS511 Advanced Database Management Systems Sequential access Unpredictable data arrival and characteristics 15 DBMS versus DSMS Persistent relations Transient streams and persistent relations One time queries Continuous queries Bounded main memory Random access Access plan determined by query processor and physical DB design Unbounded disk store CS511 Advanced Database Management Systems Sequential access Unpredictable data arrival and characteristics 16 Major Projects Aurora Brown Workflow oriented Cougar Cornell Sensor DB Gigascope AT T Network monitoring NiagaraCQ Wisconsin XML Web OpenCQ GA Tech Web StatStream NYU Online statistics from many streams Stream Stanford All purpose approximate query TelegraphCQ Berkeley Sensor data adaptivity CS511 Advanced Database Management Systems 17 Requirements of a DSMS Data model Query semantics Must allow order based e g last 100 values and time based operations e g last 5 minutes Storage Can t store a complete stream often store approximate summary structures i e synopses or digests Answers will be inexact but should be as accurate as possible Backtracking over a data stream is not feasible Streaming query plans may not use blocking operators Query processing On line stream algorithms are restricted to making only one pass over the data Applications that monitor streams in real time must react quickly to unusual data values Long running queries may encounter changes in system conditions throughout their execution lifetimes e g variable stream rates Shared execution of many continuous queries is needed to ensure scalability CS511 Advanced Database Management Systems 18 Query Processing Architecture Synopses Output Stream Query Plans Running Op X Ready Op Applications register continuous queries Waiting Op Input Data Streams CS511 Advanced Database Management Systems X Users issue continuous and ad hoc queries Administrator can monitor query execution and adjust run time 19 parameters Data Model Database relations data streams Stream characteristics Type of data schema Data distribution Flow rate Stability of distribution and flow Ordering and other constraints Synchronization of multiple streams Distributed streams CS511 Advanced Database Management Systems 20 Window Models Direction of movement of the endpoints Fixed window both fixed Sliding window both moving Landmark window 1 fixed 1 moving Physical time based vs logical count based window Update interval Eager updating after seeing every new tuple Lazy Batch updating after seeing many tuples may lead to tumbling windows CS511 Advanced Database Management Systems 21 Data Stream Queries Basic Issues Answer availability One time Multiple time Continuous standing stored or streamed Registration time Predefined Ad hoc Stream access Arbitrary Sliding window special case size 1 CS511 Advanced Database Management Systems 22 Fundamental Stream Operations Selection e g monitoring network traffic from an IP Multiplexing demultiplexing similar to GroupBy and Union Frequent items queries top k or threshold queries Joins multi stream join join a stream with static data Nested aggregation e g comparing a minimum with a running average Stream mining pattern matching similarity search forecasting Windowed queries any query can be constrained to return results within a window CS511 Advanced
View Full Document