Data Association for Topic Intensity Tracking

Home> Academic Documents> Data Association for Topic Intensity Tracking

DOC PREVIEW

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Data Association for Topic Intensity TrackingDocument classificationA more difficult exampleTypical email traffic (Enron data)Identifying both topics and burstsData association problemThe TaskHow to reason about topic deltas?Generating message arrival timesGenerative Model (conceptual)Do we really need ETA vectors?Slide 12Exponential distribution appropriate?Experimental setupResults (Synthetic data)Slide 16Slide 17Slide 18Inference comparison (Synthethic data)Slide 20Slide 21Slide 22Experiments on real document streamsIntensity identification for Enron dataSlide 25Slide 26Slide 27Reuters news archiveWhat about classification?Classification performanceGeneralizationsTracking topic drift over timeConclusion1Data Association for Topic Intensity TrackingAndreas KrauseJure LeskovecCarlos GuestrinSchool of Computer Science, Carnegie Mellon University2Document classificationEmails from two topics: Conference and HikingWill you go toICML too?Let’s go hikingon Friday!P(C | words) = .9P(C | words) = .1 Conference  Hiking3A more difficult exampleEmails from two topics: Conference and HikingWhat if we had temporal information?How about modeling emails as HMM?Let’s have dinnerafter the talk.Should we go onFriday?P(C | words) = .7P(C | words) = .5Could refer to both topics! 2:00 pm2:03 pm ConferenceC1D1C2D2CtDtCt+1Dt+1Assumes equal time steps,“smooth” topic changes.Valid assumptions?40 50 100 150 20005101520Time (days)Number of emails0 50 100 150 200024681012Time (days)Number of emailsTypical email traffic (Enron data)Email traffic very burstyCannot model with uniform time steps! Topic intensities change over time, separately per topicBursts tell us, how intensely a topic is pursued Bursts are potentially very interesting!BurstsNo emailsTopic 1Topic 25Identifying both topics and burstsGiven:A stream of documents (emails): d1, d2, d3, …and corresponding document inter-arrival times (time between consecutive documents):Δ1, Δ2, Δ3, ...Simultaneously:Classify (or cluster) documents into K topicsPredict the topic intensities – predict time between consecutive documents from the same topic6Data association problemIf we know the email topics, we can identify burstsIf we don’t know the topics, we can’t identify bursts!Two-step solution: First classify documents, then identify bursts [Kleinberg ’03] Can fail badly! This paper: Simultaneously identify topics and bursts! High intensity for “Conference”,Low intensity for “Hiking”Low intensity for “Conference”,High intensity for “Hiking”timeIntensity for “Conference” ???Intensity for “Hiking” ???ConferenceHiking7The TaskHave to solve a data association problem: We observe:Message Deltas – time between the arrivals of consecutive documentsWe want to estimate:Topic Deltas – time between messages of the same topicWe can then compute the topic intensity L = E[ 1/Therefore, need to associate each document with a topicNeed topics to identify intensityNeed intensity toclassify (better)Chicken and Eggproblem:8How to reason about topic deltas?Associate with each email time vectors  per topicC: 2:00 pmH: 2:30 pmC: 4:15pmH: 2:30 pmC: 4:15 pmH: 7:30 pmConferenceHikingMessage = min 2 – min 1(= 30min) Topic = 2h 15min (consecutive msg. of same topic)τ1τ2τ3Expected arrival times per topicΔ2C2Topic C= argmin 2(= Hiking) Email 1,ConferenceAt 2:00 pmEmail 2,HikingAt 2:30 pmEmail 3,ConferenceAt 4:15 pm9Generating message arrival timesC: 2:00 pmH: 2:30 pmC: 4:15pmH: 2:30 pmC: 4:15 pmH: 7:30 pmτ1τ2τ3Does not change, as topic not “active”.Incremented by exponential distribution,parameter Exp(L(C))L(H)1L(H)2L(H)3L(C)1L(C)2L(C)2Want generative model for the time vectors 10L(H)t-1L(H)tL(H)t+1τt-1τtτt+1CtDtΔtGenerative Model (conceptual)L(C)t-1L(C)tL(C)t+1ETA per topicMessage TopicDocumentIntensity for“Conference”Intensity for“Hiking”Problem: Need to reason about entire history of timesteps t!(Domain of t grows linearly with time.)Makes inference intractable, even for few topics!11Do we really need ETA vectors?We know Message t = min t – min t-1.Since Topic  follow exponential distribution, memorylessness impliesP(t+1(C) > 4pm | t (C) = 2pm, it’s now 3pm) = P(t+1(C) > 4pm | t (C) = 3pm, it’s now 3pm)Hence t distributed as min {Exp(Lt(C)),Exp(Lt(H))}Closed form: t ~ Exp(Lt(C) + Lt(H) )Similarly, Ct ~ argmin {Exp(Lt(C)),Exp(Lt(H))}Closed form: Ct ~ Bernoulli( Lt(C) / (Lt(C) + Lt(H) ) )Last arrival time irrelevant!Can discard ETA vectors ! Quite general modeling trick!12L(H)t-1L(H)tL(H)t+1τt-1τtτt+1CtDtΔtGenerative Model (conceptual)L(C)t-1L(C)tL(C)t+1Implicit Data Association (IDA) ModelTurns model (essentially) into Factorial HMMMany efficient inference techniques available! 13Exponential distribution appropriate?Previous work on document streams (E.g., Kleinberg ‘03)Frequently used to model transition timesWhen adding hidden variables, can model arbitrary transition distributions (Nodelman et al)14Experimental setupInference ProceduresFull (conceptual) model:Particle filterSimplified Model:Particle filterFully factorized mean fieldExtract inferenceComparison to the two-step approach (first classify, then identify bursts)15Results (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…Naïve Bayesmisclassifies based on features0 20 40 60 80051015202530Message numberTopic deltaTopic 16Results (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…0 20 40 60 80015202530Message numberTopic deltaTopic Part. Filt.(Full model)Naïve Bayesmisclassifies based on features17Results (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…0 20 40 60 80015202530Message numberTopic deltaTopic Part. Filt.(Full model)ExactinferenceNaïve Bayesmisclassifies based on features18Naïve Bayesmisclassifies based on featuresResults (Synthetic data)Periodic message arrivals (uninformative Δ) with noisy class assignments: ABBBABABABBB…0 20 40 60 80051015202530Message numberTopic deltaTopic Part. Filt.(Full model)ExactinferenceWeighted automaton(first classify, then bursts)Implicit Data


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Please select your school