DOC PREVIEW
prospectus

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Automatic Blog Monitoring and SummarizationWith/without organized accessInaccessible?IntroductionModulesMonitoringFrameworkOverviewCharacteristicsDefinition of metricsSlide 11Problem formulationApproachResource allocationRetrieval schedulingSingle retrieval per periodSlide 17Multiple retrievals per periodExampleExperimentPerformanceSize of estimation windowPredictability of posting rateSummaries and extensionsCollectionSlide 26Slide 27Relinquishment of blogsTopic detection and trackingSlide 30Influence network in blogsSlide 32Data characteristicsSlide 34TopicsDocument similaritySlide 37ProblemsConclusionEndMore examplesMajor posting patternsAutomatic Blog Monitoring and SummarizationKa Cheung “Richard” SiaPhD ProspectusWith/without organized accessInaccessible?% of Feeds Vs # of Subscribers0%20%40%60%80%100%1+ 20+ 50+ 1000+ 5000+# of Subscribers% of FeedsBy AskJeevesIntroductionOrganized access to blogsFull coverageReflect changes quicklyFiltered and organized presentationIntended ContributionsEfficient techniques to harvest blogs Algorithms to monitor frequently changing data sourcesAlgorithms to reconstruct implicit networks and compose topic summariesModulesMonitoringCollection (future work)Topic detection and tracking (future work)ConclusionMonitoringPreliminary resultsFrameworkA central server monitors data source changes and provides succinct summaries to usersOverviewNew challengesContent change more rapidly with recurring patternMore time-sensitive requirementsModeling of posting updateDefinition of delayStrategies for allocation and schedulingCharacteristicsHomogeneous Poisson modelλ(t) = λ at any tPeriodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…Definition of metricsDelay of a data sourcesum of elapsed time for every postDelay experienced by the aggregatorijittD )(kiitDOD1)()(niiiODwAD1)()(Definition of metricsτj – retrieval timeλ(t) – posting rateExpected delayHomogeneous Poisson modelInhomogeneous Poisson model2)()(21jjODjjdtttODj1))(()(Problem formulationMinimization of expected delay experienced by the aggregator under constraint of limited resources.Schedule τj’s such thatis minimized.niiiODwAD1)()(ApproachResource allocationHow often to contact data sources?O1 is more active than O2, how much more often should we contact O1 than O2?Retrieval schedulingWhen to contact a data source?3 retrievals are allocated for O1, when should these 3 retrievals be located?Resource allocationConsider n data source O1, …, Onλi – posting rate of Oiwi – weight of OiN – total number of retrievals per daymi – number of retrievals per day allocated to OiOptimal allocationiiiwmRetrieval schedulingm retrieval(s) per day are allocated to a data source O, how should we schedule these m retrievals?m=1m>1Single retrieval per periodλ(t) = 1, t [0,1], λ(t)=0, t [1,2]Periodicity T=2τ = 0.5, expected delay = 0.75τ = 1, expected delay = 0.5τ = 2, expected delay = 1.5Single retrieval per periodFor a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:0)( and )(1)(optimalityfor Criteria0dtddttTTTdttTtdtttD))(())(()(0Multiple retrievals per periodm retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by:11111))(()(TdtttODmmiiiijjdttjjj1)())((optimalityfor Criteria1Example6 retrievals for λ(t)=2+2sin(2πt)jjdttjjj1)())((optimalityfor Criteria1ExperimentData – 10k RSS feeds over Oct – Dec 2004PerformanceCGM03 – optimize for “age”Ours – both resource allocation and retrieval schedulingSize of estimation windowResource constraint: 4 retrievals per day per feeds on average2 weeks is an appropriate choicePredictability of posting rate90% of the RSS feeds post consistentlySummaries and extensionsResource allocation is more aggressiveRetrieval scheduling optimizes within individual data sourceInclude user access patternVariable retrieval costCollectionFuture workCollectionBlog hosting websiteCentral repository~5.3M URLs from weblogs.comlimited and contaminatedCrawlingRetrieve maximum number of blog while reducing number of irrelevant pages downloadedDomain Count Categoryspaces.msn.com 839,663 Blogblogspot.com 362,957 Blogwretch.cc 116,161 Blogsearch-net101.com 89,750 Spam/adsabalty.com 86,329 Spam/adssearch-now854.com 80,109 Spam/adsbigebiz.org 79,059 Spam/adsCollectionBlogs are inter-connected (blogrolls)Selectively following links, discovering hubs for blogsblogblog[1] Chakrabarti et.al. “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, The International WWW conference 1999Relinquishment of blogsDetection of abandoned blog to save resource[2] D.R. Cox “Regression models and life-tables (with discussion)”Journal of the Royal Statistical Society, B(34), 1972[3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”Technical report, Microsoft ResearchTopic detection and trackingFuture workOverviewCharacteristicsDocument streamTraces of information propagation among blogsChallengesModeling growth and death of a topicRanking of blog articles Malicious contentInfluence network in blogsInformation are “diffused” among blogsIndicator of popularitySocial relationship among bloggersInfluence network in blogsFour major patterns of propagationReconstruction of implicit networkRanking (source authority)Advertising campaignData characteristics~ 97 - 98 % daily content are newData characteristicsSame content last for ~8 daysTopicsTopics with different lifespanBurstyMid-rangeSustainingEvolving of topic[4] J. Kleinberg, “Bursty and Hierarchical Structure in Streams”in SIGKDD 2002[5] J. Kleinberg, “Temploral Dynamics of On-Line Information Streams”Data Stream Management: Processing High-Speed Data Stream, Springer 2005Document similaritySparse and diverse~400 articles clustered into 21 clusters out of 10,000 daily articles (by DBSCAN)FrameworkDocument stream


prospectus

Download prospectus
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view prospectus and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view prospectus 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?