DOC PREVIEW
SWARTHMORE CS 97 - Code-name DUTCHMAN: A Text Summarization System

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Appeared in: Proceedings of the Class of 2003 Senior Conference, pages 63–68Computer Science Department, Swarthmore Collegecode-name DUTCHMAN:A Text Summarization SystemErik OsheimSwarthmore [email protected] SproulSwarthmore [email protected] summarization is an interesting and chal-lenging problem in natural language process-ing, and one that has numerous potential appli-cations in the realm of data-mining, text search-ing, and information retrieval. We have im-plemented a summarization system appropriatefor articles and technical texts.The system, code-named DUTCHMAN, at-tempts to identify which sentences in the doc-ument are most appropriate for inclusion in asummary based on analysis using key noun-phrases. The system employs WordNet in or-der to extend the notion of key phrases to keyconcepts.The system, in its current instantiation, onlyachieves mediocre results, but our work doessuggest some promising avenues for future re-search in text summarization.1 IntroductionThe general problem of text summarization is very broadand difficult: Given some document of arbitrary length,can we produce a second document of roughly constantlength (ie. a few sentences) that conveys the generalmeaning of the original? How can we process a text todetermine what the text is about, and then reformulatethat information to produce a viable summary? Certainly,humans are able to do this, but this is typically contin-gent on our ability to not only parse and read, but alsounderstand the document in question. Thus, to fully ad-dress text summarization in general, we would need tofirst solve a large number of difficult and currently unre-solved natural language processing and artificial intelli-gence problems.One option would be to use a knowledge base to iden-tify semantic facts and topics in a document; withoutfully solving things like the symbol grounding problem,we can still hope to understand the text’s subject. Sum-marizers which take this approach are known as symbolicsummarizers. However, there are some difficulties withtaking a heavily symbolic approach. First, since it re-quires a large knowledge base in order to function, theresults do not generalize across languages well. Sec-ond, symoblic approaches are especially vulnerable to thedepth vs. robustness trade-off. Simply put, systems thatare created to analyze a certain type of document can re-strict themselves to that domain, allowing them to makemore assumptions about the source texts and thus per-form better, at the expense of generality (Hovy, 2000).Since symbolic summarizers have to make a lot of as-sumptions about the text’s content, they tend to do es-pecially well when they can specialize; however, thismakes a general summarization tool difficult to imple-ment symbolically. Theme and topic recognition (twocommon methods of symbolic summarization) are stag-geringly complex in the most general cases (Mani, 1998).Fortunately, in certain domains, documents tend tocontain sentences which are self-summarizing. For ex-ample, journal and newspaper articles, and technical doc-uments, tend to begin with, and, more generally, containsentences which address the purpose and nature of thedocument as a whole. We cannot expect this sort of sen-tence to be found in certain other domains, for examplefiction, where no part of the text can be expected to per-tain to the text as a whole. Many text summarization sys-tems (eg. (Barker, 1998), (Szpakowicz, 1996)) choose toadopt such a restricted domain, and thus are able to ex-ploit the self-summarizing nature of such documents.Within this restricted domain, we can reformulate theproblem of text summarization as follows: How do weselect the best sentences for inclusion in a summary, andwhat do we do with these sentences after we have se-lected them? We have based our work on the work of theText Summarization Group at the University of Ottowa63Appeared in: Proceedings of the Class of 2003 Senior Conference, pages 63–68Computer Science Department, Swarthmore CollegeDepartment of Computer Science (Barker, 1998). Theirgeneral method involves identifying key noun phraseswithin a document, and then applying various heuristicsto weight sentences based on key phrase frequency, thenjust concatenating the sentences in order to produce asummary.Our text summarization system, code-named DUTCH-MAN, is structured similarly, but we have extended thekey phrase analysis to a form of conceptual analysisbased on WordNet, allowing us to increase the empha-sis placed on certain key phrases which are representa-tive of general concepts in which other key phrases in thedocument participate. For example, in a paper about en-gines, given that engine, camshaft, and piston are all keyphrases, the salience of the word engine will be increased,because camshaft and piston are both parts of engines.2 Related WorkDue to renewed intereset in text summarization, sev-eral conferences have recently addressed the problem.From these talks, it is obvious that researchers some-what divided over the best methods of text summariza-tion. While many researchers favor statistical approachessimilar to the one pursued in DUTCHMAN, there are alsosymbolic summarizers, which place more weight on try-ing to find important topics through world-level concepts(Hovy, 2000). These systems try to identify an underly-ing topic (or topics) before ranking phrases and sentenceson their score. (?) In this context, DUTCHMAN is a sta-tistical summarizer which utilizes symbolic information(via WordNet) in an attempt to improve its statisticallygenerated keywords.Most other projects that use symbolic information doso before their statistical processing, or do the two formsof processing independently and then attempt to integratethe results ((Szpakowicz, 1996), (Mani, 1998)). How-ever, there are many varieites of symbolic summarizers;its unclear what the best use of ontologies is, especiallygiven the depth/robustness trade-off. Some examplesof symbolic summarization methods from the TIPSTERconference are:• use a graph of theme nodes linked via a custom the-saurus (CIR).• use sentences determined to be about frequentlymentioned individuals via co-reference resolution(Penn)• use morphological analysis, name tagging, and co-reference resolution to weight sentences (SRA)Ad hoc summaries (undirected summaries like thekind DUTCHMAN generates) only comprise some of thegoal of summarization systems. Most systems


View Full Document

SWARTHMORE CS 97 - Code-name DUTCHMAN: A Text Summarization System

Documents in this Course
Load more
Download Code-name DUTCHMAN: A Text Summarization System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Code-name DUTCHMAN: A Text Summarization System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Code-name DUTCHMAN: A Text Summarization System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?