Document ClassificationMiroslav Halas([email protected])CSE 8331 Data MiningSouthern Methodist UniversityPaper vs Digital●Shift towards electronic documents, such as invoices, bank statements...●Advantages: storage, delivery, reliability, cost...●Disadvantages: copying, authenticity, experience (=trust)...●Facilitate the shift –Improve handling of increased amount of paper–Process it electronically–Computers needs to understand the documentsDefinitions●Paper document – convey information as iconic symbols (no photographs)[34]●Electronic document = document – collections of images (pages)●Form – Manhattan type layout[18], horizontal and vertical lines, preprinted and user filled data (fields)●Focus – fill-in forms, business and scientific documents –> benchmark - University of Washington image database I – III [9, 14, 17, 19, 20, 21...]Real Life Applications●Italy 2001: Medium size bank with some 100 branches produces 30,000 – 100,000 account notes and enclosures a day– Storage and RETrieval by Content of imaged documents[2] – classification for routing●US 2002: National Library of Medicine, MD, 600 bibliographical records a day takes 246 hours of retyping–Medical Article Record System 2[28] – classification to prepopulate data and identify layoutAdditional usage●Duplicate detection[11]●Image databases–query by example – Intelligent Document Image Retrieval[13]●Document clustering[6]–Document boundary detection–Missing pages–Invalid orderDocument image understanding● “Formal representation of abstract relationships indicated by the two dimensional arangements of the symbols”[34]● Similarity of documents–Look–Content–FunctionGeometric, functional and logical document description[12]Human understanding of similarity●How do you measure correctness/quality of result?●How much details do humans need to judge similarity?●Appearance (Layout, presentation) is enough to make judgments about similarity (not considering content)–Physical structure analysis –ThumbnailsHuman understanding of similarity●Similarity experiment[14]●12 thumbnails used to classify 979 images from UoW I databaseAlgorithm evaluation●High-volume environment, both throughput and amount●Fast – constant time●Scalable – data set grows●Accurate – false positives/false negatives●Data used–Robust – degradation of quality–Unique – important for classification–Compact – storage and retrievalPreprocessing●Noise–Remove - filters–Ignore – threshold for connected components●Skew–Detect●Hough Transformation based – transformation fro Cartesian space (x, y) into sinusoidal space (, )●Projection Profile based – accumulation of black pixels in different directions●Main optimization, how many pixels do you need to investigate – high/low resolution images–Correct or Account forData in the image●Geometrical layout–Segmentation - detect and describe–Classification based on spatial characteristics Gtree, DDT●Statistical data–Extract global and local features–Classification using numeric data – DT, neural networks●Text–Shape coding - without using OCR describe the look of the text–Information retrievalSegmentation●Divide document into homogeneous areas of interest●RLSA – Wong 1982[29]–Smear together black pixels which are separated by less than a predetermined number of white pixels–Perform in horizontal and vertical direction ●2 Images●AND between them–Black connected areas will be separated by white space – change threshold to build hierarchySegmentation (cont.)●RXYC–Homogeneous blocks of document can be bound by rectangular regions–Top-down●Detect white space and horizontal and vertical lines–Projection profiles of black pixels–Detect max and min in histogram●Perform alternatively vertical or horizontal cut●Output is hierarchical structure–Document is root–Homogeneous regions are nodes–Leaf determined by thresholdSegmentation (cont.)●RXYC (cont.)–Bottom-up●Detect low level connected components●Classify them as image or text●Projection profiles of connected components●Detect globally maximal vallyes in the profiles●Cuts and grouping●Comparison–TD more natural (coarse to detailed), maintains hierarchical structure–BU computationally more expensive, starts from low detailSegmentation (cont.)●Bottom-up approach[20]Segmentation (cont.)●Problems–Parameters – thresholds–Training performance – days[25]●Alternative representation – white space bounding rectangles[7, 8, 9]–Characterize the content (black pixels) using the white pixels (background)–May provide more efficient representation–Requires presegmentationLocal and global features●Statistical characteristics of the document image●Global - identified for the whole image●Local – indentified for a window or a homogeneous region●Examples–Average gray area of the region[1, 2]–Average word height, character width, line spacing and line indentation[6]–Statistics for height and width of connected components (count, sum, mean, median) for area, perimeter...[14, 15]Shape coding●Analyze images of text without use of OCR[10, 11, 13]–Detect connected components–Map individual symbols into smaller character set–Use statistics such as descender, ascender, X line crossing count, punctuation●Analyze whole image–Information retrieval, keyword searches●Analyze just a characteristic line –Classification based on signatureClassification algorithms●Document page matching algorithm[15]–Classification based on spatial layout difficult –> has to account for errors in segmentation, deformations, similar shapes–Input: segmented image and database of segmented template images–Output: Class (template image) –Algorithm outline: ●Identify templates which overlap with input●Discard matches which have obviously different layout (vertical overlap) while keep those which may not match 100% due to errors (horizontal segmenation)●Compute percentage of overlap and not overlap●Rank templates accordinglyClassification algorithms (cont.)●Feature based classification[15]–Documents with variable layout–Input: Features extracted from an image and decision tree or neural network–Decision Tree (OC1 – linear combination of features ) build or Neural Network (Kohonen SOM) trained using features extracted from the training
View Full Document