SWARTHMORE CS 97 - Code-name DUTCHMAN: A Text Summarization System - D1909328

Home> Schools> Swarthmore College> (CS) > CS 97> Code-name DUTCHMAN: A Text Summarization System

DOC PREVIEW

SWARTHMORE CS 97 - Code-name DUTCHMAN: A Text Summarization System

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Appeared in: Proceedings of the Class of 2003 Senior Conference, pages 63–68Computer Science Department, Swarthmore Collegecode-name DUTCHMAN:A Text Summarization SystemErik OsheimSwarthmore [email protected] SproulSwarthmore [email protected] summarization is an interesting and chal-lenging problem in natural language process-ing, and one that has numerous potential appli-cations in the realm of data-mining, text search-ing, and information retrieval. We have im-plemented a summarization system appropriatefor articles and technical texts.The system, code-named DUTCHMAN, at-tempts to identify which sentences in the doc-ument are most appropriate for inclusion in asummary based on analysis using key noun-phrases. The system employs WordNet in or-der to extend the notion of key phrases to keyconcepts.The system, in its current instantiation, onlyachieves mediocre results, but our work doessuggest some promising avenues for future re-search in text summarization.1 IntroductionThe general problem of text summarization is very broadand difficult: Given some document of arbitrary length,can we produce a second document of roughly constantlength (ie. a few sentences) that conveys the generalmeaning of the original? How can we process a text todetermine what the text is about, and then reformulatethat information to produce a viable summary? Certainly,humans are able to do this, but this is typically contin-gent on our ability to not only parse and read, but alsounderstand the document in question. Thus, to fully ad-dress text summarization in general, we would need tofirst solve a large number of difficult and currently unre-solved natural language processing and artificial intelli-gence problems.One option would be to use a knowledge base to iden-tify semantic facts and topics in a document; withoutfully solving things like the symbol grounding problem,we can still hope to understand the text’s subject. Sum-marizers which take this approach are known as symbolicsummarizers. However, there are some difficulties withtaking a heavily symbolic approach. First, since it re-quires a large knowledge base in order to function, theresults do not generalize across languages well. Sec-ond, symoblic approaches are especially vulnerable to thedepth vs. robustness trade-off. Simply put, systems thatare created to analyze a certain type of document can re-strict themselves to that domain, allowing them to makemore assumptions about the source texts and thus per-form better, at the expense of generality (Hovy, 2000).Since symbolic summarizers have to make a lot of as-sumptions about the text’s content, they tend to do es-pecially well when they can specialize; however, thismakes a general summarization tool difficult to imple-ment symbolically. Theme and topic recognition (twocommon methods of symbolic summarization) are stag-geringly complex in the most general cases (Mani, 1998).Fortunately, in certain domains, documents tend tocontain sentences which are self-summarizing. For ex-ample, journal and newspaper articles, and technical doc-uments, tend to begin with, and, more generally, containsentences which address the purpose and nature of thedocument as a whole. We cannot expect this sort of sen-tence to be found in certain other domains, for examplefiction, where no part of the text can be expected to per-tain to the text as a whole. Many text summarization sys-tems (eg. (Barker, 1998), (Szpakowicz, 1996)) choose toadopt such a restricted domain, and thus are able to ex-ploit the self-summarizing nature of such documents.Within this restricted domain, we can reformulate theproblem of text summarization as follows: How do weselect the best sentences for inclusion in a summary, andwhat do we do with these sentences after we have se-lected them? We have based our work on the work of theText Summarization Group at the University of Ottowa63Appeared in: Proceedings of the Class of 2003 Senior Conference, pages 63–68Computer Science Department, Swarthmore CollegeDepartment of Computer Science (Barker, 1998). Theirgeneral method involves identifying key noun phraseswithin a document, and then applying various heuristicsto weight sentences based on key phrase frequency, thenjust concatenating the sentences in order to produce asummary.Our text summarization system, code-named DUTCH-MAN, is structured similarly, but we have extended thekey phrase analysis to a form of conceptual analysisbased on WordNet, allowing us to increase the empha-sis placed on certain key phrases which are representa-tive of general concepts in which other key phrases in thedocument participate. For example, in a paper about en-gines, given that engine, camshaft, and piston are all keyphrases, the salience of the word engine will be increased,because camshaft and piston are both parts of engines.2 Related WorkDue to renewed intereset in text summarization, sev-eral conferences have recently addressed the problem.From these talks, it is obvious that researchers some-what divided over the best methods of text summariza-tion. While many researchers favor statistical approachessimilar to the one pursued in DUTCHMAN, there are alsosymbolic summarizers, which place more weight on try-ing to find important topics through world-level concepts(Hovy, 2000). These systems try to identify an underly-ing topic (or topics) before ranking phrases and sentenceson their score. (?) In this context, DUTCHMAN is a sta-tistical summarizer which utilizes symbolic information(via WordNet) in an attempt to improve its statisticallygenerated keywords.Most other projects that use symbolic information doso before their statistical processing, or do the two formsof processing independently and then attempt to integratethe results ((Szpakowicz, 1996), (Mani, 1998)). How-ever, there are many varieites of symbolic summarizers;its unclear what the best use of ontologies is, especiallygiven the depth/robustness trade-off. Some examplesof symbolic summarization methods from the TIPSTERconference are:• use a graph of theme nodes linked via a custom the-saurus (CIR).• use sentences determined to be about frequentlymentioned individuals via co-reference resolution(Penn)• use morphological analysis, name tagging, and co-reference resolution to weight sentences (SRA)Ad hoc summaries (undirected summaries like thekind DUTCHMAN generates) only comprise some of thegoal of summarization systems. Most systems

View Full Document