Stanford CS 276B - Lecture 13 - Text Mining I - D2028830

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Lecture 13 - Text Mining I

Stanford CS 276B - Lecture 13 - Text Mining I

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 65

Download Save

Unformatted text preview:

CS276B Text Information Retrieval, Mining, and ExploitationToday’s TopicsFirst Story DetectionSlide 4ExamplesSlide 6The First-Story Detection TaskDefinitionsTDT TasksFirst Story Detection (FSD)Patterns in Event DistributionsSimilar Events over TimeTDT: The CorpusIdeas?Approach 1: KNNVariant: Single Pass ClusteringApproach 2: KNN + TimeFSD - ResultsFSD Error vs. Classification ErrorDiscussionSummarizationWhat is a Summary?Why Automatic Summarization?Summary Length (Reuters)Summary Compression (Reuters)Slide 26Summarization AlgorithmsSentence ExtractionSentence Extraction: ExampleSlide 30Feature RepresentationFeature Representation (cont.)TrainingEvaluationSlide 35Multi-Document (MD) SummarizationTypes of MD SummariesDetermine MD Summary TypeMultiGen Architecture (Columbia)GenerationProcessingPerformance (Columbia System)Newsblaster (Columbia)Query-Specific SummarizationGenreNon-Text SummariesSlide 47Coreference ResolutionCoreferenceTypes of CoreferencePreferences in Pronoun InterpretationSlide 52Algorithm for Coreference ResolutionSalience Weights (Lappin&Leass)Lappin&Leass (cont’d)Algorithm (Lappin&Leass)ExampleExample (cont’d)Slide 59Slide 60Slide 61Slide 62ObservationsMUC Information Extraction: State of the Art c. 1997ResourcesCS276BText Information Retrieval, Mining, and ExploitationLecture 13Text Mining IIFeb 27, 2003(includes slides borrowed from J. Allan, G. Doddington, G. Neumann, M. Venkataramani, and D. Radev)Today’s TopicsFirst story detection (FSD)SummarizationCoreference resolutionFirst Story DetectionFirst Story DetectionAutomatically identify the first story on a new event from a stream of textTopic Detection and Tracking – TDT “Bake-off” sponsored by US government agenciesApplicationsIntelligence servicesFinance: Be the first to trade a stockExamples2002 Presidential ElectionsThai Airbus Crash (11.12.98)On topic: stories reporting details of the crash, injuries and deaths; reports on the investigation following the crash; policy changes due to the crash (new runway lights were installed at airports).Euro Introduced (1.1.1999)On topic: stories about the preparation for the common currency (negotiations about exchange rates and financial standards to be shared among the member nations); official introduction of the Euro; economic details of the shared currency; reactions within the EU and around the world.First Story DetectionOther technologies don’t work for thisInformation retrievalText classificationWhy?There is no supervised topic training (like Topic Detection)TimeFirst StoriesNot First Stories= Topic 1= Topic 2The First-Story Detection TaskTo detect the first story that discusses a topic, for all topics.DefinitionsEvent: A reported occurrence at a specific time and place, and the unavoidable consequences. Specific elections, accidents, crimes, natural disasters.Activity: A connected set of actions that have a common focus or purpose - campaigns, investigations, disaster relief efforts.Topic: a seminal event or activity, along with all directly related events and activitiesStory: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic.TDT TasksFirst story detection (FSD)Detect the first story on a new topicTopic trackingOnce a topic has been detected, identify subsequent stories about itStandard text classification taskHowever, very small training set (initially: 1!)First Story Detection (FSD)First story detection is an unsupervised learning task.On-line vs. RetrospectiveOn-line: Flag onset of new events from live news feeds as stories come inRetrospective: Detection consists of identifying first story looking back over longer periodLack of advance knowledge of new events, but have access to unlabeled historical data as a contrast setFSD input: stream of stories in chronological order simulating real-time incoming document streamFSD output: YES/NO decision per documentPatterns in Event DistributionsNews stories discussing the same event tend to be temporally proximateA time gap between burst of topically similar stories is often an indication of different eventsDifferent earthquakesAirplane accidentsA significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nounsEvents are typically reported in a relatively brief time window of 1- 4 weeksSimilar Events over TimeTDT: The CorpusTDT evaluation corpora consist of text and transcribed news from 1990s.A set of target events (e.g., 119 in TDT2) is used for evaluationCorpus is tagged for these events (including first story)TDT2 consists of 60,000 news stories, Jan-June 1998, about 3,000 are “on topic” for one of 119 topicsStories are arranged in chronological orderIdeas?Approach 1: KNNOn-line processing of each incoming storyCompute similarity to all previous storiesCosine similarityLanguage modelProminent termsExtracted entitiesIf similarity is below threshold: new storyIf similarity is above threshold for previous document d: assign to topic of dOptimal threshold can be chosen based on historical dataThreshold is not topic specific!Variant: Single Pass ClusteringAssign each incoming document to one of a set of topic clustersA topic cluster is represented by its centroid (vector average of members)For incoming story compute similarity s with centroidAs before:s>θ: add document to corresponding cluster s<θ: first story!Approach 2: KNN + TimeOnly consider documents in a (short) time windowCompute similarity in a time weighted fashion:m: number of documents in window, d_i: ith document in windowTime weighting significantly increases performance.FSD - ResultsUmass , CMU: Single-Pass ClusteringFSD Error vs. Classification ErrorDiscussionHard problemBecomes harder the more topics need to be tracked. Why?Second Story Detection much easier that First Story DetectionExample: retrospective detection of first 9/11 story easy, on-line detection hardSummarizationWhat is a Summary?Informative summaryPurpose: replace original documentExample: executive summaryIndicative summaryPurpose: support decision: do I want to read original document yes/no?Example: Headline, scientific abstractWhy Automatic Summarization?Algorithm for reading

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Lecture 13 - Text Mining I

Sign up for free to view:

Please select your school