Stanford CS 276B - Lecture 14 - Text Mining II - D2052943

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Lecture 14 - Text Mining II

Stanford CS 276B - Lecture 14 - Text Mining II

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 60

Download Save

Unformatted text preview:

CS276B Web Search and MiningText MiningSummarizationWhat is a Summary?Why Automatic Summarization?Summary Length (Reuters)Slide 7Summarization AlgorithmsSentence ExtractionSentence Extraction: ExampleSlide 11Feature RepresentationFeature Representation (cont.)TrainingEvaluationEvaluation of featuresMulti-Document (MD) SummarizationTypes of MD SummariesDetermine MD Summary TypeMultiGen Architecture (Columbia)GenerationProcessingNewsblaster (Columbia)Query-Specific SummarizationGenreDiscussionCoreference ResolutionCoreferenceTypes of CoreferencePreferences in pronoun interpretationSlide 31An algorithm for pronoun resolutionSalience weights in Lappin and LeassLappin and Leass (cont’d)AlgorithmObservationsBiological Text MiningBiological Terminology: A ChallengeWhat are the concepts of interest?Complex PhrasesInconsistencyRapid ChangeWhere’s the Information?Curators Cannot Keep Up with the Literature!Biomedical Named Entity RecognitionResults of Finkel et al. (2004) MEMM-based BioNER systemAbbreviations in BiologyAmbiguity Example PCA has >60 expansionsProblem 1: AmbiguityProblem 2: “Coreference”ApproachFeatures for ClassifierText-Enhanced Sequence Homology DetectionPSI-BLASTText-Enhanced Homology Search (Chang, Raychaudhuri, Altman)Slide 56Modification to PSI-BLASTSlide 58Slide 59ResourcesCS276BWeb Search and MiningLecture 14Text Mining II(includes slides borrowed from G. Neumann, M. Venkataramani, R. Altman, L. Hirschman, and D. Radev)Text MiningPreviously in Text MiningThe General TopicLexiconsTopic Detection and TrackingQuestion AnsweringToday’s TopicsSummarizationCoreference resolutionBiomedical text miningSummarizationWhat is a Summary?Informative summaryPurpose: replace original documentExample: executive summaryIndicative summaryPurpose: support decision: do I want to read original document yes/no?Example: Headline, scientific abstractWhy Automatic Summarization?Algorithm for reading in many domains is:1) read summary2) decide whether relevant or not3) if relevant: read whole documentSummary is gate-keeper for large number of documents.Information overloadOften the summary is all that is read.Example from last quarter: summaries of search engine hitsHuman-generated summaries are expensive.Summary Length (Reuters)Goldstein et al. 1999Summarization AlgorithmsKeyword summariesDisplay most significant keywordsEasy to doHard to read, poor representation of contentSentence extractionExtract key sentencesMedium hardSummaries often don’t read wellGood representation of contentNatural language understanding / generationBuild knowledge representation of textGenerate sentences summarizing contentHard to do wellSomething between the last two methods?Sentence ExtractionRepresent each sentence as a feature vectorCompute score based on featuresSelect n highest-ranking sentencesPresent in order in which they occur in text.Postprocessing to make summary more readable/conciseEliminate redundant sentencesAnaphors/pronounsDelete subordinate clauses, parentheticalsOracle ContextSentence Extraction: ExampleSigir95 paper on summarization by Kupiec, Pedersen, ChenTrainable sentence extractionProposed algorithm is applied to its own description (the paper)Sentence Extraction: ExampleFeature RepresentationFixed-phrase featureCertain phrases indicate summary, e.g. “in summary”Paragraph featureParagraph initial/final more likely to be important.Thematic word featureRepetition is an indicator of importanceUppercase word featureUppercase often indicates named entities. (Taylor)Sentence length cut-ofSummary sentence should be > 5 words.Feature Representation (cont.)Sentence length cut-ofSummary sentences have a minimum length.Fixed-phrase featureTrue for sentences with indicator phrase“in summary”, “in conclusion” etc.Paragraph featureParagraph initial/medial/finalThematic word featureDo any of the most frequent content words occur?Uppercase word featureIs uppercase thematic word introduced?TrainingHand-label sentences in training set (good/bad summary sentences)Train classifier to distinguish good/bad summary sentencesModel used: Naïve BayesCan rank sentences according to score and show top n to user.EvaluationCompare extracted sentences with sentences in abstractsEvaluation of featuresBaseline (choose first n sentences): 24%Overall performance (42-44%) not very good.However, there is more than one good summary.Multi-Document (MD) SummarizationSummarize more than one documentWhy is this harder?But benefit is large (can’t scan 100s of docs)To do well, need to adopt more specific strategy depending on document set.Other components needed for a production system, e.g., manual post-editing.DUC: government sponsored bake-of200 or 400 word summariesLonger → easierTypes of MD SummariesSingle event/person tracked over a long time periodElizabeth Taylor’s bout with pneumoniaGive extra weight to character/eventMay need to include outcome (dates!)Multiple events of a similar natureMarathon runners and racesMore broad brush, ignore datesAn issue with related eventsGun controlIdentify key concepts and select sentences accordinglyDetermine MD Summary TypeFirst, determine which type of summary to generateCompute all pairwise similaritiesVery dissimilar articles → multi-event (marathon)Mostly similar articlesIs most frequent concept named entity?Yes → single event/person (Taylor)No → issue with related events (gun control)MultiGen Architecture (Columbia)GenerationOrdering according to dateIntersectionFind concepts that occur repeatedly in a time chunkSentence generatorProcessingSelection of good summary sentencesElimination of redundant sentencesReplace anaphors/pronouns with noun phrases they refer toNeed coreference resolutionDelete non-central parts of sentencesNewsblaster (Columbia)Query-Specific SummarizationSo far, we’ve look at generic summaries.A generic summary makes no assumption about the reader’s interests.Query-specific summaries are specialized for a single information need, the query.Summarization is much easier if we have a description of what the user wants.Recall from last quarter:Google-type excerpts – simply show

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Lecture 14 - Text Mining II

Sign up for free to view:

Please select your school