CS276B Web Search and MiningText MiningSummarizationWhat is a Summary?Why Automatic Summarization?Summary Length (Reuters)Slide 7Summarization AlgorithmsSentence ExtractionSentence Extraction: ExampleSlide 11Feature RepresentationFeature Representation (cont.)TrainingEvaluationEvaluation of featuresMulti-Document (MD) SummarizationTypes of MD SummariesDetermine MD Summary TypeMultiGen Architecture (Columbia)GenerationProcessingNewsblaster (Columbia)Query-Specific SummarizationGenreDiscussionCoreference ResolutionCoreferenceTypes of CoreferencePreferences in pronoun interpretationSlide 31An algorithm for pronoun resolutionSalience weights in Lappin and LeassLappin and Leass (cont’d)AlgorithmObservationsBiological Text MiningBiological Terminology: A ChallengeWhat are the concepts of interest?Complex PhrasesInconsistencyRapid ChangeWhere’s the Information?Curators Cannot Keep Up with the Literature!Biomedical Named Entity RecognitionResults of Finkel et al. (2004) MEMM-based BioNER systemAbbreviations in BiologyAmbiguity Example PCA has >60 expansionsProblem 1: AmbiguityProblem 2: “Coreference”ApproachFeatures for ClassifierText-Enhanced Sequence Homology DetectionPSI-BLASTText-Enhanced Homology Search (Chang, Raychaudhuri, Altman)Slide 56Modification to PSI-BLASTSlide 58Slide 59ResourcesCS276BWeb Search and MiningLecture 14Text Mining II(includes slides borrowed from G. Neumann, M. Venkataramani, R. Altman, L. Hirschman, and D. Radev)Text MiningPreviously in Text MiningThe General TopicLexiconsTopic Detection and TrackingQuestion AnsweringToday’s TopicsSummarizationCoreference resolutionBiomedical text miningSummarizationWhat is a Summary?Informative summaryPurpose: replace original documentExample: executive summaryIndicative summaryPurpose: support decision: do I want to read original document yes/no?Example: Headline, scientific abstractWhy Automatic Summarization?Algorithm for reading in many domains is:1) read summary2) decide whether relevant or not3) if relevant: read whole documentSummary is gate-keeper for large number of documents.Information overloadOften the summary is all that is read.Example from last quarter: summaries of search engine hitsHuman-generated summaries are expensive.Summary Length (Reuters)Goldstein et al. 1999Summarization AlgorithmsKeyword summariesDisplay most significant keywordsEasy to doHard to read, poor representation of contentSentence extractionExtract key sentencesMedium hardSummaries often don’t read wellGood representation of contentNatural language understanding / generationBuild knowledge representation of textGenerate sentences summarizing contentHard to do wellSomething between the last two methods?Sentence ExtractionRepresent each sentence as a feature vectorCompute score based on featuresSelect n highest-ranking sentencesPresent in order in which they occur in text.Postprocessing to make summary more readable/conciseEliminate redundant sentencesAnaphors/pronounsDelete subordinate clauses, parentheticalsOracle ContextSentence Extraction: ExampleSigir95 paper on summarization by Kupiec, Pedersen, ChenTrainable sentence extractionProposed algorithm is applied to its own description (the paper)Sentence Extraction: ExampleFeature RepresentationFixed-phrase featureCertain phrases indicate summary, e.g. “in summary”Paragraph featureParagraph initial/final more likely to be important.Thematic word featureRepetition is an indicator of importanceUppercase word featureUppercase often indicates named entities. (Taylor)Sentence length cut-ofSummary sentence should be > 5 words.Feature Representation (cont.)Sentence length cut-ofSummary sentences have a minimum length.Fixed-phrase featureTrue for sentences with indicator phrase“in summary”, “in conclusion” etc.Paragraph featureParagraph initial/medial/finalThematic word featureDo any of the most frequent content words occur?Uppercase word featureIs uppercase thematic word introduced?TrainingHand-label sentences in training set (good/bad summary sentences)Train classifier to distinguish good/bad summary sentencesModel used: Naïve BayesCan rank sentences according to score and show top n to user.EvaluationCompare extracted sentences with sentences in abstractsEvaluation of featuresBaseline (choose first n sentences): 24%Overall performance (42-44%) not very good.However, there is more than one good summary.Multi-Document (MD) SummarizationSummarize more than one documentWhy is this harder?But benefit is large (can’t scan 100s of docs)To do well, need to adopt more specific strategy depending on document set.Other components needed for a production system, e.g., manual post-editing.DUC: government sponsored bake-of200 or 400 word summariesLonger → easierTypes of MD SummariesSingle event/person tracked over a long time periodElizabeth Taylor’s bout with pneumoniaGive extra weight to character/eventMay need to include outcome (dates!)Multiple events of a similar natureMarathon runners and racesMore broad brush, ignore datesAn issue with related eventsGun controlIdentify key concepts and select sentences accordinglyDetermine MD Summary TypeFirst, determine which type of summary to generateCompute all pairwise similaritiesVery dissimilar articles → multi-event (marathon)Mostly similar articlesIs most frequent concept named entity?Yes → single event/person (Taylor)No → issue with related events (gun control)MultiGen Architecture (Columbia)GenerationOrdering according to dateIntersectionFind concepts that occur repeatedly in a time chunkSentence generatorProcessingSelection of good summary sentencesElimination of redundant sentencesReplace anaphors/pronouns with noun phrases they refer toNeed coreference resolutionDelete non-central parts of sentencesNewsblaster (Columbia)Query-Specific SummarizationSo far, we’ve look at generic summaries.A generic summary makes no assumption about the reader’s interests.Query-specific summaries are specialized for a single information need, the query.Summarization is much easier if we have a description of what the user wants.Recall from last quarter:Google-type excerpts – simply show
View Full Document