New version page

Optical Character Recognition Errors and Their Effects on Natural Language Processing

This preview shows page 1-2-3-4 out of 11 pages.

View Full Document
View Full Document

End of preview. Want to read all 11 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

International Journal on Document Analysis and Recognition manuscript No.(will be inserted by the editor)Optical Character Recognition Errors and Their Effects on NaturalLanguage ProcessingDaniel LoprestiDepartment of Computer Science and Engineering, Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, USAReceived December 19, 2008 / Revised August 23, 2009Abstract. Errors are unavoidable in advanced c omputervision applications such as optical character recognition,and the noise induced by these errors presents a seri-ous challenge to downstream processes that attempt tomake use of such data . In this paper, we apply a newparadigm we have proposed for mea suring the impactof recognition errors on the stages of a standard textanalysis pipeline: sentence boundary detection, tokeniza-tion, and part-of-speech tagging. Our methodology for-mulates error classification as an optimization problemsolvable using a hierarchical dynamic programming ap-proach. Errors and their cascading effects are isolatedand analyzed as they travel through the pipeline. Wepresent experimental results based on a la rge collectionof scanned pages to study the varying impact dependingon the nature o f the error and the character(s) involved.This dataset has als o been made available online to en-courage future investigations.Key words: Performance evaluation – Optical charac-ter recognition – Sentence boundary detection – Tok-enization – Part-of-speech tagging1 IntroductionDespite dec ades of research and the existence of estab-lished commercial products, the output from optical char-acter recognition (OCR) processes often contain errors.The more highly degraded the input, the greater the er-ror rate. Since such systems can form the first stage ina pipeline where later stages are designed to support so-phisticated information extraction and exploitation ap-plications, it is important to understand the effects ofrecognition errors on downstream text analysis routines.Are a ll recognition errors equal in impac t, or are someworse than others? Can the performance of each stagebe optimized in isolation, or must the end-to-end sys-tem be considered? What are the most serious forms ofdegradation a page can suffer in the context of naturallanguage processing? In balancing the tradeoff betweenthe risk o f over- and under-segmenting characters duringOCR, where sho uld the line be drawn to maximize over-all performance? The answers to these questions shouldinfluence the way we design and build document ana ly sissystems.Researchers have alre ady begun studying problemsrelating to processing text data from noisy sources. Todate, this work has focus ed predominately on er rors thatarise during speech recognition. For example, Palmer andOstendorf describe an approach for improving named en-tity extraction by explicitly modeling speech recognitionerrors through the use of statistics annotated with confi-dence scores [18]. The inaugural Workshop on Analyticsfor Noisy Unstructured Text Data [23] a nd its followupworkshops [24,25] have featured papers examining theproblem of noise from a variety of perspectives, withmost emphasizing issues that are inherent in written andsp oken language.There has been le ss work, however, in the case ofnoise induced by optical character re cognition. Early pa-pers by Taghva, Borsack, and Condit s how that mo der-ate error rates have little impac t on the effectiveness oftraditional informa tion retrieval measure s [21], but thisconclusion is tied to cer tain assumptions about the IRmodel (“bag of words”), the OCR error rate (not toohigh), and the leng th of the documents (not too short).Miller, et al. study the performance of named entity ex-traction under a variety of scenarios involving both ASRand OCR output [17], although speech is their primaryinterest. They found that by tra ining their system onboth clean and noisy input material, performance de-graded linearly as a function of word error rates.Farooq and Al-Onaizan proposed an approach forimproving the output of machine translation when pre-sented with OCR’ed input by modeling the error correc-tion process itself as a translation pro blem [5].A paper by Jing, Lopresti, a nd Shih studied the prob-lem of summarizing textual documents that had under-gone optical character recognition and hence sufferedfrom typical OCR erro rs [10]. From the standpoint ofperformance evaluation, this work employed a varietyof indirect measures: for example, comparing the total2 Daniel Lopresti: Optical Character Recognition Errors and Their Effects on N atu ral Language ProcessingFig. 1. Propagation of OCR errors through NLP stages (the “error cascade”).number of sentences returned by sentence boundary de-tection fo r clean and noisy versions of the same inputtext, or counting the number of incomplete parse treesgenerated by a part-of-speech tagge r.In two later papers [12,13], we turned to the questionof performance evaluatio n for text analys is pipelines,proposing a pa radigm based the hierarchical applicationof approximate str ing matching techniques. This flexi-ble yet mathematically rigor ous approach both quanti-fies the performance of a given processing stage as wellas identifies explicitly the errors it has ma de. Also pre-sented were the results of pilot studies where small sets ofdocuments (tens of pages) were OCR’ed and then pipedthrough standard ro utines for sentence boundary detec-tion, to kenization, a nd part-of-speech ta gging, demon-strating the utility of the appr oach.In the pre sent paper, we employ this same evaluationparadigm, but using a much la rger and more realisticdataset totaling over 3,000 scanned pages which we arealso making available to the community to foster work inthis area [14]. We study the impact of several real-worlddegradations on optical character rec ognition and theNLP processes that follow it, and plot later-sta ge per-formance as a function of the input OCR accuracy. Weconclude by outlining possible topics for future research.2 Stages in Text AnalysisIn this section, we describ e the prototypical stages thatare common to many text analysis systems, dis c uss someof the problems that can arise, and then list the sp ecificpackages we use in our work. The stages, in order, are:(1) optical character recognition, (2) sentence boundarydetection, (3) tokenization, and (4) part-of-s peech tag-ging. These basic pro c e dures are of interest because theyform the basis for mor e


Loading Unlocking...
Login

Join to view Optical Character Recognition Errors and Their Effects on Natural Language Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Optical Character Recognition Errors and Their Effects on Natural Language Processing and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?