Rutgers University CS 536 - Data Mining on Streams - D2587293

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> Data Mining on Streams

Rutgers University CS 536 - Data Mining on Streams

School name Rutgers University- The State University of New Jersey

Pages 16

Download Save

Unformatted text preview:

1Data Mining on StreamsUsing Decision TreesCS 536: Machine LearningInstructor: Michael LittmanTA: Yihua WuOutline• Introduction to data streams• Overview of traditional DT learning ALG• DT learning ALGs on streams• Open issues of learning on streams2What is Data Mining• A technique for finding and describing structural patterns in data;• A tool for helping to explain data and makepredictions from it.E.G.supermarket chainstock market analysisThe Data Stream Phenomenon• Highly detailed, automatic, rapid data feeds. – Radar: meteorological observations.– Satellite: geodetics, radiation,…– Astronomical surveys: optical, IR, radio,…– Internet: traffic logs, user queries, email, financial,– Sensor nodes: many more “observation points’’.• Need for near-real time analysis of data feeds.– Detect outliers, extreme events, fraud, intrusion, anomalous activity, complex correlations, classification,…– Monitoring.3Models of Data Streams• Signal s[1…n]. n is universe size. • Three models:– Timeseries model: s(1), s(2),…, s(t),….– Cash Register model: s(j)= s(j)+ a(k). a(k) >0. (insert only)– Turnstile model: s(j)= s(j)+ u(k). (both insert and delete)Traditional vs. Stream Miningstrictunlimitmemoryapproximateaccurateresultmultipleonenum of conceptsstrictunlimittimesinglemultiplenum of passesStreamTraditional4Desiderata– Per item processing time– Space stored– Time for computing functions on s– Must all be polylog(n,||s||).• And…– A one pass algorithmDecision Tree• Decision tree is a classification model. Its basic structure is a general tree structure.internal node: test on example’s attribute valuesleaf node: class labels• Input: a set of examples (including both attribute and class variables)• Output:training data: a decision tree structuretest data: the predicted class5Exampleoutlook temp humidity windy playsunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true nosunnyoutlookwindyhumidityyesyes noyesnoovercastrainyhigh normal false true( sunny, hot, normal, true, ? )Building Decision Trees• Key idea: 1) pick an attribute to test at root;2) divide the training data into subsets Di for each value the attribute can take on;3) build the tree for each Diand splice it in under the appropriate branch at the root• Pick an attribute:Find an attribute that divides data into as pure subsets as possible.6Entropy Value• Entropy valueGiven P1, …, Pm, 0≤Pi≤1, 1 ≤i ≤m; 1Pm1ii=∑=∑=−=miiimPPPPPentropy1212log),...,,(0 1/2 1 P1Entropy(P1, 1-P1)The larger the entropy value, the less pure the data is.Gain Value• For each class Ciand given data D, let ni=|Dc=ci| (m is the # of classes)Data-Info([n1, n2, …, nm])=entropy(n1/n, n2/n, …, nm/n), where n=|D|.• For feature fiwith values v1, …, vaiSplit-Info(fi, D)=where wj=|Dfi=vj|/|D|.•Gain(fi, D) = Data-Info([n1, n2, …, nm])-Split-Info(fi, D)• Pick fiwith the largest Gain(fi, D).∑=====−imjijiajccvfccvfjDDInfoDataw1,,|])||,...,([| 17Sufficient Statistics∑=====−imjijiajccvfccvfjDDInfoDataw1,,|])||,...,([| 1Split-Info(fi, D)=2||(log)||ij ijk ijkiijijjknnnnnn=−∑∑Sufficient Statistics: nijk, the number of examples whose ithattribute has the jthvalue, and are classified to the kthclass.CriteriaTraining DataTraining DataDecision Tree StructureDecision Tree StructureTest DataTest Datapredicted classEqual?Equal?real classAccuracy of Decision TreesAccuracy of Decision Treescriteria8Drawbacks• One pass of data for each internal node, multiple passes in total. (stream: only one pass)• Once I make the decision on an attribute, never reconsider. (stream: concept drifts)Hoeffding Bound• Consider a real-valued random variable r whose range is R. Suppose we have n independent observations of this variable, and compute their mean . The hoeffding bound states that, with probability 1-δ, the true mean of the variable is at least -ε, whererrnR2)/1ln(2δε=9Properties• The hoeffding bound is independent of the probability distribution generating the observations.• With high probability, the attribute chosen using n examples is the same that would be chosen using infinite examples.Hoeffding Tree Algorithm (1)• Inputs: S is a sequence of examples,X is a set of discrete attributes,G(.) is a split evaluation function,δ is one minus the desired probability of choosing the correct attribute at any given node.• Output: HT is a decision tree.10Hoeffding Tree Algorithm (2)Procedure HoeffdingTree(S, X, G, δ)Let HT be a tree with a single leaf l1(the root).For each class ykFor each value xijof each attribute XiXLet nijk(l1)=0.For each example (x, yk) in SSort (x, y) into a leaf l using HT.For each xijin x such that XiXlIncrement nijk(l).If the examples seen so far at l are not all of the same class, thenCompute Gl(Xi) for each attribute XiXlusing nijk(l).Let Xabe the attribute with highest Gl.Let Xbbe the attribute with second-highest Gl.Compute ε using hoeffding bound.If Gl(Xa)- Gl(Xb)> ε, thenReplace l by an internal node that splits on Xa.For each branch of the splitAdd a new leaf lm, and let Xm= X -{Xa}.For each class ykand each value xijof each attribute XiXmLet nijk(lm)=0.Return HT.∈∈∈∈Problems in Practice• More than one attribute very close to the current best.•How much time spent on a single example?• Memory needed with the tree expansion?• Number of candidate attributes at each node?11VFDT System• It’s the Very Fast Decision Tree learner, based on the hoeffding tree algorithm.• Refinements:–Ties. If ∆G<ε<τ, where τ is a user-specified threshold, split on the current best attribute.– G computation. Specify an nminthat must be accumulated at a leaf before G is recomputed.–Memory. If the max available memory is reached, VFDT deactivates the least promising leaves (w/ the lowest plel) to make room for new ones. Can be reactivated if more promising later.–

View Full Document


School:
Email:
New Password:
Confirm Password:

Rutgers University CS 536 - Data Mining on Streams

Sign up for free to view:

Please select your school