Automatic Blog Monitoring and SummarizationWith/without organized accessInaccessible?IntroductionModulesMonitoringFrameworkOverviewCharacteristicsDefinition of metricsSlide 11Problem formulationApproachResource allocationRetrieval schedulingSingle retrieval per periodSlide 17Multiple retrievals per periodExampleExperimentPerformanceSize of estimation windowPredictability of posting rateSummaries and extensionsCollectionSlide 26Slide 27Relinquishment of blogsTopic detection and trackingSlide 30Influence network in blogsSlide 32Data characteristicsSlide 34TopicsDocument similaritySlide 37ProblemsConclusionEndMore examplesMajor posting patternsAutomatic Blog Monitoring and SummarizationKa Cheung “Richard” SiaPhD ProspectusWith/without organized accessInaccessible?% of Feeds Vs # of Subscribers0%20%40%60%80%100%1+ 20+ 50+ 1000+ 5000+# of Subscribers% of FeedsBy AskJeevesIntroductionOrganized access to blogsFull coverageReflect changes quicklyFiltered and organized presentationIntended ContributionsEfficient techniques to harvest blogs Algorithms to monitor frequently changing data sourcesAlgorithms to reconstruct implicit networks and compose topic summariesModulesMonitoringCollection (future work)Topic detection and tracking (future work)ConclusionMonitoringPreliminary resultsFrameworkA central server monitors data source changes and provides succinct summaries to usersOverviewNew challengesContent change more rapidly with recurring patternMore time-sensitive requirementsModeling of posting updateDefinition of delayStrategies for allocation and schedulingCharacteristicsHomogeneous Poisson modelλ(t) = λ at any tPeriodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…Definition of metricsDelay of a data sourcesum of elapsed time for every postDelay experienced by the aggregatorijittD )(kiitDOD1)()(niiiODwAD1)()(Definition of metricsτj – retrieval timeλ(t) – posting rateExpected delayHomogeneous Poisson modelInhomogeneous Poisson model2)()(21jjODjjdtttODj1))(()(Problem formulationMinimization of expected delay experienced by the aggregator under constraint of limited resources.Schedule τj’s such thatis minimized.niiiODwAD1)()(ApproachResource allocationHow often to contact data sources?O1 is more active than O2, how much more often should we contact O1 than O2?Retrieval schedulingWhen to contact a data source?3 retrievals are allocated for O1, when should these 3 retrievals be located?Resource allocationConsider n data source O1, …, Onλi – posting rate of Oiwi – weight of OiN – total number of retrievals per daymi – number of retrievals per day allocated to OiOptimal allocationiiiwmRetrieval schedulingm retrieval(s) per day are allocated to a data source O, how should we schedule these m retrievals?m=1m>1Single retrieval per periodλ(t) = 1, t [0,1], λ(t)=0, t [1,2]Periodicity T=2τ = 0.5, expected delay = 0.75τ = 1, expected delay = 0.5τ = 2, expected delay = 1.5Single retrieval per periodFor a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:0)( and )(1)(optimalityfor Criteria0dtddttTTTdttTtdtttD))(())(()(0Multiple retrievals per periodm retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by:11111))(()(TdtttODmmiiiijjdttjjj1)())((optimalityfor Criteria1Example6 retrievals for λ(t)=2+2sin(2πt)jjdttjjj1)())((optimalityfor Criteria1ExperimentData – 10k RSS feeds over Oct – Dec 2004PerformanceCGM03 – optimize for “age”Ours – both resource allocation and retrieval schedulingSize of estimation windowResource constraint: 4 retrievals per day per feeds on average2 weeks is an appropriate choicePredictability of posting rate90% of the RSS feeds post consistentlySummaries and extensionsResource allocation is more aggressiveRetrieval scheduling optimizes within individual data sourceInclude user access patternVariable retrieval costCollectionFuture workCollectionBlog hosting websiteCentral repository~5.3M URLs from weblogs.comlimited and contaminatedCrawlingRetrieve maximum number of blog while reducing number of irrelevant pages downloadedDomain Count Categoryspaces.msn.com 839,663 Blogblogspot.com 362,957 Blogwretch.cc 116,161 Blogsearch-net101.com 89,750 Spam/adsabalty.com 86,329 Spam/adssearch-now854.com 80,109 Spam/adsbigebiz.org 79,059 Spam/adsCollectionBlogs are inter-connected (blogrolls)Selectively following links, discovering hubs for blogsblogblog[1] Chakrabarti et.al. “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, The International WWW conference 1999Relinquishment of blogsDetection of abandoned blog to save resource[2] D.R. Cox “Regression models and life-tables (with discussion)”Journal of the Royal Statistical Society, B(34), 1972[3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”Technical report, Microsoft ResearchTopic detection and trackingFuture workOverviewCharacteristicsDocument streamTraces of information propagation among blogsChallengesModeling growth and death of a topicRanking of blog articles Malicious contentInfluence network in blogsInformation are “diffused” among blogsIndicator of popularitySocial relationship among bloggersInfluence network in blogsFour major patterns of propagationReconstruction of implicit networkRanking (source authority)Advertising campaignData characteristics~ 97 - 98 % daily content are newData characteristicsSame content last for ~8 daysTopicsTopics with different lifespanBurstyMid-rangeSustainingEvolving of topic[4] J. Kleinberg, “Bursty and Hierarchical Structure in Streams”in SIGKDD 2002[5] J. Kleinberg, “Temploral Dynamics of On-Line Information Streams”Data Stream Management: Processing High-Speed Data Stream, Springer 2005Document similaritySparse and diverse~400 articles clustered into 21 clusters out of 10,000 daily articles (by DBSCAN)FrameworkDocument stream
or
We will never post anything without your permission.
Don't have an account? Sign up