SMU CSE 8331 - Document classification - D2827032

Home> Schools> Southern Methodist University> Computer Science & Engineering (CSE) > CSE 8331> Document classification

SMU CSE 8331 - Document classification

School name Southern Methodist University

Pages 17

Download Save

Unformatted text preview:

1. Introduction2. Definitions3. Document data mining applications4. Document image understanding5. Human understanding of similarity6. Physical structure analysis6.1 Algorithm evaluation criteria6.2 Noise6.3 Skew6.4 Segmentation6.5 Local and global feature extraction6.6 Shape coding7. Classification of documents8. Summary9. ReferencesDocument classificationMiroslav [email protected]. IntroductionThe society is shifting towards dominant use of digital information. We have theoption to receive bank statements, credit card bills, shopping invoices or proofs ofdelivery for the goods we have purchased online in the form of electronic documents.Many countries have passed laws making digital signature equal counterpart of thetraditional one. The electronic representation of documents has allowed us to accessinformation conveyed by these documents in more efficient manner. The storage ofelectronic information has become more reliable making the fear of loosing archives inthe events similar to the destruction of Great Library of Alexandria [33] hopefully thingof the past. The cost of duplication of electronic information became insignificantenabling on one hand beneficial transfer of knowledge and on the other hand raisingconcerns about collapse of the traditional business models for publishing, music andvisual media distribution.In spite of all the advantages of electronic media, people still prefer paper torecord, transfer and archive information. Paper is still the prevalent medium used in theoffices to conduct business. This might be due to the fact that paper has been used in thismanner for about 5000 years since the invention of papyrus [32]. We have learned to trustit as well as deal with its shortcomings. It is uncertain how long it will take to gain thesame level of trust and familiarity with electronic documents. Before a complete shift topaperless world can occur, we need to learn how to more efficiently handle the growingamount of paper we produce today.2. DefinitionsMeaning of term “paper document” can be very broad. In context of this paper weadopt from [34] definition of paper documents as “objects created expressly to conveyinformation encoded as iconic symbols”. This definition therefore excludes non-symbolicdocuments such as portraits, medical or satellite images. Electronic document (sometimesalso called image document [1] or just short document) is then for the purpose of thispaper defined as ordered collection of images created using scanner or fax from the paperdocument. Individual images are called pages. Special kind of (paper or image) documentis form. Form is different from other documents because as described in [16, 25] itcontains almost exclusively horizontal and vertical lines creating Manhattan type layout(as defined in [18]). In addition to preprinted information it also contains user filled-indata. The locations with data are called fields. Many reviewed papers limit themselveseven more and instead of considering any kind of paper based document they focus ondocuments consisting of only a single page ignoring relationships between multiplepages. Also the area of interest is mainly fill-in forms, scientific papers or businessletters. This is probably due to the fact, that the most popular benchmark data sets,University of Washington image databases UW I – III, consist of images of documents inEnglish language selected from scientific and technical journals [9, 14, 17, 19, 20, 21, 23,25, 26].Document image processing is process of recognizing, understanding andorganizing documents based on given criteria [1]. Two distinct phases and generallyrecognized during this step [20, 25]. During document analysis phase the image issegmented into a regions of interests. During document understanding phase theseregions are classified into different types such as text or image based on the geometricand physical characteristics of the objects. Also logical meaning such as title, abstract,paragraphs, address, summary is assigned to the recognized objects and a reading orderbetween these is established. Both of these phases are often labeled with a common namedocument image understanding [36].3. Document data mining applicationsDocument classification has many practical applications for recognition,understanding, organization and retrieval of image documents.In [2] Appiani et al. state: “A medium size bank with some hundred branchesproduces from 30,000 up to 100,000 account notes and enclosures a day.” In order toprocess this amount of paper documents efficiently in an automated fashion every pagehave to be transferred to electronic form, analyzed and labeled for further computerprocessing. Once in the electronic form, classification of documents into predefinedcategories allows system to route them automatically for further processing improvingthe workflow of data in the organization. In [2] system named STRETCH (Storage andRETrieval by Content of imaged documents) is described. This system automates theprocess of classification, archiving and retrieval of documents. Their classificationmethod is based on the custom decision tree algorithm creating DDT - document decisiontree. The tree nodes represent modified XY trees obtained during documentsegmentation. Transfer of existing archives of paper documents and books into more accessibleelectronic form provides another possibility for document classification. Libraries keepdetailed catalogues of books and journals, but production of such catalogue in manualfashion is very labor intensive. Manual creation of 600 bibliographic records a day topopulate MEDLINE, the database of National Library of Medicine, Bethesda, Maryland,required 246 hours of retyping of information [28]. This represents work of 30 peoplemanually transferring data from paper to computer 8 hours a day. In [28] authors describeMARS-2, Medical Article Record System, which by classification of scanned pages intoknow categories of publications can automatically associate with these documents knownattributes such as publisher or name of the journal. Familiar class of documents alsomakes the process of automated recognition and

View Full Document


School:
Email:
New Password:
Confirm Password:

SMU CSE 8331 - Document classification

Sign up for free to view:

Please select your school