Data Association for Topic Intensity TrackingDocument classificationA more difficult exampleTypical email traffic (Enron data)Identifying both topics and burstsData association problemThe TaskHow to reason about topic deltas?Generating message arrival timesGenerative Model (conceptual)Do we really need ETA vectors?Slide 12Exponential distribution appropriate?Experimental setupResults (Synthetic data)Slide 16Slide 17Slide 18Inference comparison (Synthethic data)Slide 20Slide 21Slide 22Experiments on real document streamsIntensity identification for Enron dataSlide 25Slide 26Slide 27Reuters news archiveWhat about classification?Classification performanceGeneralizationsTracking topic drift over timeConclusion1Data Association for Topic Intensity TrackingAndreas KrauseJure LeskovecCarlos GuestrinSchool of Computer Science, Carnegie Mellon University2Document classificationEmails from two topics: Conference and HikingWill you go toICML too?Let’s go hikingon Friday!P(C | words) = .9P(C | words) = .1 Conference Hiking3A more difficult exampleEmails from two topics: Conference and HikingWhat if we had temporal information?How about modeling emails as HMM?Let’s have dinnerafter the talk.Should we go onFriday?P(C | words) = .7P(C | words) = .5Could refer to both topics! 2:00 pm2:03 pm ConferenceC1D1C2D2CtDtCt+1Dt+1Assumes equal time steps,“smooth” topic changes.Valid assumptions?40 50 100 150 20005101520Time (days)Number of emails0 50 100 150 200024681012Time (days)Number of emailsTypical email traffic (Enron data)Email traffic very burstyCannot model with uniform time steps! Topic intensities change over time, separately per topicBursts tell us, how intensely a topic is pursued Bursts are potentially very interesting!BurstsNo emailsTopic 1Topic 25Identifying both topics and burstsGiven:A stream of documents (emails): d1, d2, d3, …and corresponding document inter-arrival times (time between consecutive documents):Δ1, Δ2, Δ3, ...Simultaneously:Classify (or cluster) documents into K topicsPredict the topic intensities – predict time between consecutive documents from the same topic6Data association problemIf we know the email topics, we can identify burstsIf we don’t know the topics, we can’t identify bursts!Two-step solution: First classify documents, then identify bursts [Kleinberg ’03] Can fail badly! This paper: Simultaneously identify topics and bursts! High intensity for “Conference”,Low intensity for “Hiking”Low intensity for “Conference”,High intensity for “Hiking”timeIntensity for “Conference” ???Intensity for “Hiking” ???ConferenceHiking7The TaskHave to solve a data association problem: We observe:Message Deltas – time between the arrivals of consecutive documentsWe want to estimate:Topic Deltas – time between messages of the same topicWe can then compute the topic intensity L = E[ 1/Therefore, need to associate each document with a topicNeed topics to identify intensityNeed intensity toclassify (better)Chicken and Eggproblem:8How to reason about topic deltas?Associate with each email time vectors per topicC: 2:00 pmH: 2:30 pmC: 4:15pmH: 2:30 pmC: 4:15 pmH: 7:30 pmConferenceHikingMessage = min 2 – min 1(= 30min) Topic = 2h 15min (consecutive msg. of same topic)τ1τ2τ3Expected arrival times per topicΔ2C2Topic C= argmin 2(= Hiking) Email 1,ConferenceAt 2:00 pmEmail 2,HikingAt 2:30 pmEmail 3,ConferenceAt 4:15 pm9Generating message arrival timesC: 2:00 pmH: 2:30 pmC: 4:15pmH: 2:30 pmC: 4:15 pmH: 7:30 pmτ1τ2τ3Does not change, as topic not “active”.Incremented by exponential distribution,parameter Exp(L(C))L(H)1L(H)2L(H)3L(C)1L(C)2L(C)2Want generative model for the time vectors 10L(H)t-1L(H)tL(H)t+1τt-1τtτt+1CtDtΔtGenerative Model (conceptual)L(C)t-1L(C)tL(C)t+1ETA per topicMessage TopicDocumentIntensity for“Conference”Intensity for“Hiking”Problem: Need to reason about entire history of timesteps t!(Domain of t grows linearly with time.)Makes inference intractable, even for few topics!11Do we really need ETA vectors?We know Message t = min t – min t-1.Since Topic follow exponential distribution, memorylessness impliesP(t+1(C) > 4pm | t (C) = 2pm, it’s now 3pm) = P(t+1(C) > 4pm | t (C) = 3pm, it’s now 3pm)Hence t distributed as min {Exp(Lt(C)),Exp(Lt(H))}Closed form: t ~ Exp(Lt(C) + Lt(H) )Similarly, Ct ~ argmin {Exp(Lt(C)),Exp(Lt(H))}Closed form: Ct ~ Bernoulli( Lt(C) / (Lt(C) + Lt(H) ) )Last arrival time irrelevant!Can discard ETA vectors ! Quite general modeling trick!12L(H)t-1L(H)tL(H)t+1τt-1τtτt+1CtDtΔtGenerative Model (conceptual)L(C)t-1L(C)tL(C)t+1Implicit Data Association (IDA) ModelTurns model (essentially) into Factorial HMMMany efficient inference techniques available! 13Exponential distribution appropriate?Previous work on document streams (E.g., Kleinberg ‘03)Frequently used to model transition timesWhen adding hidden variables, can model arbitrary transition distributions (Nodelman et al)14Experimental setupInference ProceduresFull (conceptual) model:Particle filterSimplified Model:Particle filterFully factorized mean fieldExtract inferenceComparison to the two-step approach (first classify, then identify bursts)15Results (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…Naïve Bayesmisclassifies based on features0 20 40 60 80051015202530Message numberTopic deltaTopic 16Results (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…0 20 40 60 80015202530Message numberTopic deltaTopic Part. Filt.(Full model)Naïve Bayesmisclassifies based on features17Results (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…0 20 40 60 80015202530Message numberTopic deltaTopic Part. Filt.(Full model)ExactinferenceNaïve Bayesmisclassifies based on features18Naïve Bayesmisclassifies based on featuresResults (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…0 20 40 60 80051015202530Message numberTopic deltaTopic Part. Filt.(Full model)ExactinferenceWeighted automaton(first classify, then bursts)Implicit Data
or
We will never post anything without your permission.
Don't have an account? Sign up