Unformatted text preview:

Document ClassificationMiroslav Halas([email protected])CSE 8331 Data MiningSouthern Methodist UniversityPaper vs Digital●Shift towards electronic documents, such as invoices, bank statements...●Advantages: storage, delivery, reliability, cost...●Disadvantages: copying, authenticity, experience (=trust)...●Facilitate the shift –Improve handling of increased amount of paper–Process it electronically–Computers needs to understand the documentsDefinitions●Paper document – convey information as iconic symbols (no photographs)[34]●Electronic document = document – collections of images (pages)●Form – Manhattan type layout[18], horizontal and vertical lines, preprinted and user filled data (fields)●Focus – fill-in forms, business and scientific documents –> benchmark - University of Washington image database I – III [9, 14, 17, 19, 20, 21...]Real Life Applications●Italy 2001: Medium size bank with some 100 branches produces 30,000 – 100,000 account notes and enclosures a day– Storage and RETrieval by Content of imaged documents[2] – classification for routing●US 2002: National Library of Medicine, MD, 600 bibliographical records a day takes 246 hours of retyping–Medical Article Record System 2[28] – classification to prepopulate data and identify layoutAdditional usage●Duplicate detection[11]●Image databases–query by example – Intelligent Document Image Retrieval[13]●Document clustering[6]–Document boundary detection–Missing pages–Invalid orderDocument image understanding● “Formal representation of abstract relationships indicated by the two dimensional arangements of the symbols”[34]● Similarity of documents–Look–Content–FunctionGeometric, functional and logical document description[12]Human understanding of similarity●How do you measure correctness/quality of result?●How much details do humans need to judge similarity?●Appearance (Layout, presentation) is enough to make judgments about similarity (not considering content)–Physical structure analysis –ThumbnailsHuman understanding of similarity●Similarity experiment[14]●12 thumbnails used to classify 979 images from UoW I databaseAlgorithm evaluation●High-volume environment, both throughput and amount●Fast – constant time●Scalable – data set grows●Accurate – false positives/false negatives●Data used–Robust – degradation of quality–Unique – important for classification–Compact – storage and retrievalPreprocessing●Noise–Remove - filters–Ignore – threshold for connected components●Skew–Detect●Hough Transformation based – transformation fro Cartesian space (x, y) into sinusoidal space (, )●Projection Profile based – accumulation of black pixels in different directions●Main optimization, how many pixels do you need to investigate – high/low resolution images–Correct or Account forData in the image●Geometrical layout–Segmentation - detect and describe–Classification based on spatial characteristics Gtree, DDT●Statistical data–Extract global and local features–Classification using numeric data – DT, neural networks●Text–Shape coding - without using OCR describe the look of the text–Information retrievalSegmentation●Divide document into homogeneous areas of interest●RLSA – Wong 1982[29]–Smear together black pixels which are separated by less than a predetermined number of white pixels–Perform in horizontal and vertical direction ●2 Images●AND between them–Black connected areas will be separated by white space – change threshold to build hierarchySegmentation (cont.)●RXYC–Homogeneous blocks of document can be bound by rectangular regions–Top-down●Detect white space and horizontal and vertical lines–Projection profiles of black pixels–Detect max and min in histogram●Perform alternatively vertical or horizontal cut●Output is hierarchical structure–Document is root–Homogeneous regions are nodes–Leaf determined by thresholdSegmentation (cont.)●RXYC (cont.)–Bottom-up●Detect low level connected components●Classify them as image or text●Projection profiles of connected components●Detect globally maximal vallyes in the profiles●Cuts and grouping●Comparison–TD more natural (coarse to detailed), maintains hierarchical structure–BU computationally more expensive, starts from low detailSegmentation (cont.)●Bottom-up approach[20]Segmentation (cont.)●Problems–Parameters – thresholds–Training performance – days[25]●Alternative representation – white space bounding rectangles[7, 8, 9]–Characterize the content (black pixels) using the white pixels (background)–May provide more efficient representation–Requires presegmentationLocal and global features●Statistical characteristics of the document image●Global - identified for the whole image●Local – indentified for a window or a homogeneous region●Examples–Average gray area of the region[1, 2]–Average word height, character width, line spacing and line indentation[6]–Statistics for height and width of connected components (count, sum, mean, median) for area, perimeter...[14, 15]Shape coding●Analyze images of text without use of OCR[10, 11, 13]–Detect connected components–Map individual symbols into smaller character set–Use statistics such as descender, ascender, X line crossing count, punctuation●Analyze whole image–Information retrieval, keyword searches●Analyze just a characteristic line –Classification based on signatureClassification algorithms●Document page matching algorithm[15]–Classification based on spatial layout difficult –> has to account for errors in segmentation, deformations, similar shapes–Input: segmented image and database of segmented template images–Output: Class (template image) –Algorithm outline: ●Identify templates which overlap with input●Discard matches which have obviously different layout (vertical overlap) while keep those which may not match 100% due to errors (horizontal segmenation)●Compute percentage of overlap and not overlap●Rank templates accordinglyClassification algorithms (cont.)●Feature based classification[15]–Documents with variable layout–Input: Features extracted from an image and decision tree or neural network–Decision Tree (OC1 – linear combination of features ) build or Neural Network (Kohonen SOM) trained using features extracted from the training


View Full Document

SMU CSE 8331 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?