CS276B Text Information Retrieval, Mining, and ExploitationToday’s TopicsFirst Story DetectionSlide 4ExamplesSlide 6The First-Story Detection TaskDefinitionsTDT TasksFirst Story Detection (FSD)Patterns in Event DistributionsSimilar Events over TimeTDT: The CorpusIdeas?Approach 1: KNNVariant: Single Pass ClusteringApproach 2: KNN + TimeFSD - ResultsFSD Error vs. Classification ErrorDiscussionSummarizationWhat is a Summary?Why Automatic Summarization?Summary Length (Reuters)Summary Compression (Reuters)Slide 26Summarization AlgorithmsSentence ExtractionSentence Extraction: ExampleSlide 30Feature RepresentationFeature Representation (cont.)TrainingEvaluationSlide 35Multi-Document (MD) SummarizationTypes of MD SummariesDetermine MD Summary TypeMultiGen Architecture (Columbia)GenerationProcessingPerformance (Columbia System)Newsblaster (Columbia)Query-Specific SummarizationGenreNon-Text SummariesSlide 47Coreference ResolutionCoreferenceTypes of CoreferencePreferences in Pronoun InterpretationSlide 52Algorithm for Coreference ResolutionSalience Weights (Lappin&Leass)Lappin&Leass (cont’d)Algorithm (Lappin&Leass)ExampleExample (cont’d)Slide 59Slide 60Slide 61Slide 62ObservationsMUC Information Extraction: State of the Art c. 1997ResourcesCS276BText Information Retrieval, Mining, and ExploitationLecture 13Text Mining IIFeb 27, 2003(includes slides borrowed from J. Allan, G. Doddington, G. Neumann, M. Venkataramani, and D. Radev)Today’s TopicsFirst story detection (FSD)SummarizationCoreference resolutionFirst Story DetectionFirst Story DetectionAutomatically identify the first story on a new event from a stream of textTopic Detection and Tracking – TDT “Bake-off” sponsored by US government agenciesApplicationsIntelligence servicesFinance: Be the first to trade a stockExamples2002 Presidential ElectionsThai Airbus Crash (11.12.98)On topic: stories reporting details of the crash, injuries and deaths; reports on the investigation following the crash; policy changes due to the crash (new runway lights were installed at airports).Euro Introduced (1.1.1999)On topic: stories about the preparation for the common currency (negotiations about exchange rates and financial standards to be shared among the member nations); official introduction of the Euro; economic details of the shared currency; reactions within the EU and around the world.First Story DetectionOther technologies don’t work for thisInformation retrievalText classificationWhy?There is no supervised topic training (like Topic Detection)TimeFirst StoriesNot First Stories= Topic 1= Topic 2The First-Story Detection TaskTo detect the first story that discusses a topic, for all topics.DefinitionsEvent: A reported occurrence at a specific time and place, and the unavoidable consequences. Specific elections, accidents, crimes, natural disasters.Activity: A connected set of actions that have a common focus or purpose - campaigns, investigations, disaster relief efforts.Topic: a seminal event or activity, along with all directly related events and activitiesStory: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic.TDT TasksFirst story detection (FSD)Detect the first story on a new topicTopic trackingOnce a topic has been detected, identify subsequent stories about itStandard text classification taskHowever, very small training set (initially: 1!)First Story Detection (FSD)First story detection is an unsupervised learning task.On-line vs. RetrospectiveOn-line: Flag onset of new events from live news feeds as stories come inRetrospective: Detection consists of identifying first story looking back over longer periodLack of advance knowledge of new events, but have access to unlabeled historical data as a contrast setFSD input: stream of stories in chronological order simulating real-time incoming document streamFSD output: YES/NO decision per documentPatterns in Event DistributionsNews stories discussing the same event tend to be temporally proximateA time gap between burst of topically similar stories is often an indication of different eventsDifferent earthquakesAirplane accidentsA significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nounsEvents are typically reported in a relatively brief time window of 1- 4 weeksSimilar Events over TimeTDT: The CorpusTDT evaluation corpora consist of text and transcribed news from 1990s.A set of target events (e.g., 119 in TDT2) is used for evaluationCorpus is tagged for these events (including first story)TDT2 consists of 60,000 news stories, Jan-June 1998, about 3,000 are “on topic” for one of 119 topicsStories are arranged in chronological orderIdeas?Approach 1: KNNOn-line processing of each incoming storyCompute similarity to all previous storiesCosine similarityLanguage modelProminent termsExtracted entitiesIf similarity is below threshold: new storyIf similarity is above threshold for previous document d: assign to topic of dOptimal threshold can be chosen based on historical dataThreshold is not topic specific!Variant: Single Pass ClusteringAssign each incoming document to one of a set of topic clustersA topic cluster is represented by its centroid (vector average of members)For incoming story compute similarity s with centroidAs before:s>θ: add document to corresponding cluster s<θ: first story!Approach 2: KNN + TimeOnly consider documents in a (short) time windowCompute similarity in a time weighted fashion:m: number of documents in window, d_i: ith document in windowTime weighting significantly increases performance.FSD - ResultsUmass , CMU: Single-Pass ClusteringFSD Error vs. Classification ErrorDiscussionHard problemBecomes harder the more topics need to be tracked. Why?Second Story Detection much easier that First Story DetectionExample: retrospective detection of first 9/11 story easy, on-line detection hardSummarizationWhat is a Summary?Informative summaryPurpose: replace original documentExample: executive summaryIndicative summaryPurpose: support decision: do I want to read original document yes/no?Example: Headline, scientific abstractWhy Automatic Summarization?Algorithm for reading
View Full Document