Unformatted text preview:

CSC 9010: Text Mining Applications Document SummarizationDocument SummarizationDocument Summarization -- HowExtract Predefined SummaryExtract Predefined Summary: MethodsExtracting Predefined Summaries: Advantages and DisadvantagesCapture and GenerateCapture and Generate: MethodsCapture and Generate: Advantages and DisadvantagesSlide 10Extract Representative SentenceFind Representative Sentences: MethodFind Representative Sentences: Advantages and DisadvantagesSummarySome Useful References©2003 Paula MatuszekCSC 9010: Text Mining ApplicationsDocument SummarizationDr. Paula [email protected](610) 270-6851©2003 Paula MatuszekDocument SummarizationDocument Summarization –Provide meaningful summary for each documentExamples:–Search tool returns “context”–Monthly progress reports from multiple projects–Summaries of news articles on the human genomeOften part of a document retrieval system, to enable user judge documents betterSurprisingly hard to make sophisticatedSurprisingly easy to make effective©2003 Paula MatuszekDocument Summarization -- HowThree general approaches:Extract predefined summary. –Useful in highly structured environments where you can specify format. Typically very good summaries.Capture in abstract representation, generate summary–Useful in well-defined domains with clearcut information needs.Extract representative sentences/clauses. –Useful in arbitrarily complex and unstructured domains; broadly applicable, and gets "general feel".©2003 Paula MatuszekExtract Predefined Summary Documents have a well-defined format.Format includes a summary or abstract explicitly written by document author.Text mining may reorganize, regroup, restructure summaries. Example: –People working on multiple projects write monthly reports based on what they have done, one sentence/project.–Reporting system collects person-level reports and reorganizes into project-level reports.©2003 Paula MatuszekExtract Predefined Summary: MethodsExtraction using some or all of–NLP for document parsing/chunking (finding abstract)–standard computer science: database retrieval, string processing, etc.Reorganizing may be done using–explicit fields specified by author–keywords searched for in documents–business rules which capture knowledge about who is working on what tasks and projectsGrouping can shade into document classification for long summaries, ill-defined match to categories©2003 Paula MatuszekExtracting Predefined Summaries: Advantages and DisadvantagesAdvantages–Summaries reflect intent of author.–If part of an overall reporting system can actually make it simpler for author.–Incremental effort for author not large.Disadvantages–Incremental effort for author not zero either.–Only feasible in structured situation where requirement can be defined ahead of time.–Can't be used to summarize a group of documents.–Not all authors write good summaries.©2003 Paula MatuszekCapture and GenerateDocuments can have arbitrary formatKnowledge needed is well-defined.Often information need is for summarizations across multiple documentsExample: –Summarizing restaurant reviews. Take newspaper articles and produce price range, kind of food, atmosphere, quality, service.©2003 Paula MatuszekCapture and Generate: MethodsState of the art:–Create "template" or "frame"–Represent the knowledge you want to capture–Extract Information to fill in frame–Standard information extraction problem–Typically relatively large frames with relatively few relations; mostly facts.–Generate based on template–Relatively simple "fill-in-the-blank"–More complex based on parse tree.Still basically research: parse entire document into parse tree tied to rich semantic net; apply rules to trim tree; generate continuous narrative.©2003 Paula MatuszekCapture and Generate: Advantages and DisadvantagesAdvantages:–Produces very focused summaries.–Can readily incorporate multiple documents.–Not dependent on authorsDisadvantages–Assumes information need is clearly defined.–Information extraction component development time is significant–Document parsing slow; probably not real-time.Comment:–Makes no attempt to capture author's intent©2003 Paula Matuszek©2003 Paula MatuszekExtract Representative SentenceDocument format can be arbitraryDocument content can also be arbitrary; information need not clearcutSummarization consists of text extracted directly from document.Examples:–Context returned by Google for each hit–Google News summaries.©2003 Paula MatuszekFind Representative Sentences: MethodTypically, choose representative individual terms, then broaden to capture sentence containing terms. The more terms contained, the more important the sentence.–If in response to a search or other information request, the search terms are representative–If no prior query, TF*IDF and other BOW approaches. May use pairs or n-ary groups of words.May add a layer of rules using position, some specific phrases such as "In summary,".©2003 Paula MatuszekFind Representative Sentences: Advantages and DisadvantagesAdvantages–Can be applied anywhere.–Relatively fast (compared to full parse)–Provides a good general idea or feel for content.–Can do multiple-document summaries.Disadvantages–Often choppy or hard to read–Does poorly when document doesn't contain good summary sentences.–Can miss major information©2003 Paula MatuszekSummaryAppropriate approach depends on what is known about the documents, the domain, and the information need.All of the major approaches in use provide useful information in a reasonable time frame.None of the automated methods is yet close to a good human summarizer. Research in this area is advancing fast, though.©2003 Paula MatuszekSome Useful ReferencesThis is been a seriously simplified presentation; I am focusing mostly on applications. Here are some references for more detail:http://www.cs.unm.edu/~storm/TSPresent.html. Detailed overview of text summarization history, methods and current state. http://www.summarization.com/. Bibliography, tools, conferences, research. Some good resources.http://clg.wlv.ac.uk/help/summarisation.php. Relatively simple overview with some good links.http://citeseer.nj.nec.com/525002.html. Paper on summarization


View Full Document

Villanova CSC 9010 - Document Summarization

Documents in this Course
Lecture 2

Lecture 2

48 pages

Lecture 2

Lecture 2

46 pages

Load more
Download Document Summarization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Document Summarization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Document Summarization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?