DOC PREVIEW
Questioned Electronic Documents

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Chapter 1QUESTIONED ELECTRONIC DOCUMENTS: EMPIRICAL STUDIES IN AUTHORSHIPATTRIBUTIONPatrick JuolaAbstract Forensic analysis of questioned electronic documents is very difficult,because the nature of the do cuments eliminates many kinds of informa-tive differences. Recent work in authorship attribution demonstratesthe practicality of analyzing documents based on authorial style, butthe state of the art is confusing. Analyses are difficult to apply, little isknown about type or rate of errors, and no “best practices” are avail-able. We present the results of some recent experiments and softwaredevelopment to address these issues, partly through the development ofa systematic testbed for multilingual, multigenre authorship attributionaccuracy, and partly through the development and concurrent analysisof a uniform and portable software tool that applies multiple methodsto analyze electronic documents for authorship based on authorial style.Keywords: Authorship attribution, stylometrics, software development, text foren-sics1. IntroductionThe forensic importance of questioned documents is well-understood— did Aunt Martha really write this disputed version of “her” will?Document examiners can look at handwriting (or typewriting) and de-termine authorship with near miraculous sophistication from the dot ofan ‘i’ or the cross of a ‘t’. Electronic documents do not contain theseclues. Any two flat-ASCII ‘A’ characters are identical. How can onedetermine who made a defamatory, but anonymous, post on a blog, forexample? Whether the authorship of a purely electronic document can2be demonstrated to the demanding standards of a Daubert [7] hearing isan open, but important, research question.2. The ProblemWith the advent of modern computer technology, a substantial amountof “writing” today never involves pen, ink, or paper. This very paperis a good example — born as a PDF file, the first time these words seepaper is in the bound volume. If my authorship of these words werechallenged, I have no physical artifacts for specialists to examine.Furthermore, the nature of electronic documents makes it substan-tially easier to “publish” or misappropriate them tracelessly or even tocommit forgery with relative impunity. A network investigation will atbest only reveal the specific computer on which the document was writ-ten. It is almost impossible to figure out who was at the keyboard —who wrote it.Chaski [6] describes three incident-based scenarios where it is bothnecessary to pierce the GUI and impossible to do so with traditionalnetwork investigations. In all three cases, there was no question aboutwhich computer these documents came from. Instead, the question waswhether the purported authorship could b e validated. The key questionthus can be structured in terms of the message content. Can the au-thorship of an electronic document be inferred reliably from the messagecontent itself?3. Related Work3.1 Authorship attributionRecent studies in authorship attribution suggest that such an infer-ence is possible, but further research may be necessary to meet thestringent criteria of Daubert. As a problem, the question of determin-ing authorship by examining style has a long history. For example,Judges 12:5–6 describes the inference of tribal identity from the pronun-ciation of a specific word. Such shibboleths could involve specific lexicalor phonological items; a person who writes of sitting on a “Chesterfield”is presumptively Canadian [8]. Wellman [27] describes how an individ-ual spelling error — an idiosyncratic spelling of “toutch” — was used incourt to validate a document.At the same time, such tests cannot be relied upon. Idiosyncraticspelling or not, the word “touch” is rather rare (86 tokens in the million-word Brown corpus [22]), and it’s unlikely to be found independently intwo different samples. People are also not consistent in their language,Juola 3and may (mis)spell words differently at different times; often the testsmust be able to handle distributions instead of mere presence/absencejudgments. The continuing discussion of methods to do this is an activeresearch area – 70,400 hits turn up on May 4, 2006 on a Google search for“authorship attribution.” The increase from November 13, 2005 (49,500)illustrates part of the continuing activity in this area in just six months.A key insight in recent research has suggested that statistical distri-bution of com mon patterns, such as the use of prepositions, may beuniversal enough to be relied upon, while still being informative. Forthis reason, scholars have recently focused on more sophisticated andmore reliable statistical tests. Specifically, Burrows [3–5] demonstratedthat a statistical analysis of common words in large samples of text couldgroup texts by author. Since then, many additional methods [9, 11, 24,25, 2, 13, 12, 23–1, 6] have been proposed. The current state of the artis an ad hoc mess of disparate methods with little cross-comparison todetermine which methods work and which don’t. Or more accurately,because they all work at least reasonably well (under conditions as dis-cussed below, 90% accuracy is fairly typical for “good” methods. Seealso [18]), which methods work the best.Authorial analysis can even show more subtle aspects, such as datesof documents. Figure 1 shows such an analysis [16] within a singleauthor (Jack London), clearly dividing works written before 1912 fromworks after. The apparent division is a vertical line at about 3.14 on“Dimension 1.” Finding that a newly discovered London manuscriptwould be placed on the left side of the diagram would be strong evidencethat it was written after 1912 as well.3.2 Test Corpus Development : The BaayenexperimentsWith the wide variety of techniques available, it is important and yetvery difficult to compare the power and accuracy of different techniques.A fingerprint appropriate to distinguish between Jack London and Rud-yard Kipling, for example, may not work to distinguish between JaneAustin and George Eliot. A proper comparison would involve standard-ized texts of clear provenance, known authorship, on strictly controlledtopics, so that the performance of each technique can be meas ured in afair and accurate way. Forsyth [10] compiled a first benchmark collec-tion of texts for validating authorship attribution techniques. Baayen[2]has developed a more tightly controlled series of texts produced


Questioned Electronic Documents

Download Questioned Electronic Documents
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Questioned Electronic Documents and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Questioned Electronic Documents 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?