DOC PREVIEW
Stanford CS 224 - Automating Document review

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Automating Document ReviewCS224n Final ProjectNathaniel LoveJune 9, 2006AbstractLaw firms engaged in litigation expend significant time and resources on document review, a processrequiring brief examination of thousands to hundreds of thousands of client documents. Often pe rformedby first-year associates, the document review task is essentially a classification problem: documents needto be sorted into a number of categories, based on their content, their author(s), and a variety of otherfeatures. Due to the high volume of documents to be examined, and the often simple rules needed toclassify documents, this process is ripe for ass istance by a natural language processing system. Thispap e r explores the des ired features of an NLP document review system and demonstrates some resultsusing a prototype.1 Discovery and Document ReviewIn large litigation cases or government investigations, law firms expend significant effort in the discoveryprocess, during which litigants must produce internal documents relevant to the matter in question, andmake them available either to the opposing counsel or in response to a government subpoena. The documentsin question may include memoranda, financial statements, internal work papers, and a great deal of em ail.The widespread use of email by employees of large corporations has significantly increased the volume ofdocuments that need to be reviewed as part of the disc overy process. For the purposes of this exploration,we will focus on the task of classifying emails; many of the other do cume nts in question are included asattachments to emails, and references to these documents in the email body may be used for classifying bothtypes of documents.1.1 CategorizationWhen performing document review (often abbreviated as “doc review”), law firm associates are generallylooking to s ort documents among two pairs of categories. These categories are defined by the parametersof the particular case being litigated– documents may be responsive (relevant to one or more of the specificrequests in the subpoena) or non-responsive (not requested by the subpoena), and either privileged ornon-privileged. Privileged documents are attorney-client communications; in the corporate context, this1Nathaniel Love 2can include in-house counsel as well as outside attorneys. As noted above, these categories are fluid– theresponsive character of an email may hinge on the sender, the recipient, the topic, the date it was sent,and so on. The responsive category is often expressed as the union of a large group of subcategories– theumbrella term responsive is used because each of these subcategories is effectively defined as a response toa particular request made by the opposing counsel or government subpoena.1.2 CostsThis initial document review process alone is extraordinarily time-consuming and costly. A major litigationcase or government investigation can produce on the order of 500,000 documents needing review. As someattorneys involved in this process have described their work, a rate of 100 documents/hour is on the quickside of typical. This means that reviewing documents for a major case could easily require 5000 hours ofwork; when billed at the current standard first-year associate’s rate of around $275/hour, that means astaggering cost of $1.375 million to the client.1.3 Current ProcessAssociates are assigned sets of documents– the email in and out boxes of a “custodian” (employee), consistingof thousands of messages –and must pass through and classify all of them. The emails are in electronic format,and may be text-searchable, but the semantic data (sender, recipients, date) may not have been preserved,and may need to be recovered by reading the headers. Associates looking for a particular term can search forit, but using this search term will likely return significant numbers of documents in which this term is usedin a non-responsive context. Furthermore, documents for a given custodian are not necessarily presented inany particular order– they are just drawn from the mailboxes one at a time and presented for review. Thisslows down the process because associates looking for particular topics, senders, or date ranges (based onthe subpoena) are presented with essentially randomly drawn emails. Jumping from one topic (an NCAAbasketball score update email) to another (a critical internal memo detailing misconduct) in this manneralso increases the likelihood of errors.1.4 Existing Doc Review SystemsOne of the prominent doc review systems in use today is LexisNexis’ Applied Discovery. The strong sellingpoint of this electronic doc review system is the move to PDF represe ntations of all docume nts, from thescanned TIFF files that previous systems used1. This means that documents are now text-searchable, andApplied Discovery can now extract senders, recipients, and dates from emails, and track email threads2.Many of Applied Discovery’s competitors include systems like that still use TIFF files and do not offer textsearch capabilities.These doc review systems have overcome the non-trivial task of acquiring all the relevant documents(from various computer systems, hard copies, etc.) and getting them all in the same electronic format– this1See, for example, http://www.lexisnexis.com/applieddiscovery/lawlibrary/whitePapers/ADI PDFTrumpsTIFF.pdf2Features described at http://www.lexisnexis.com/applieddiscovery/electronicDiscovery/litigation.aspNathaniel Love 3alone has made the document review process much more efficient. However, even the most advanced docreview systems are taking only minimal advantage of the opportunities offered by the transition to electronictext.2 NLP Doc ReviewThe document review and classification performed by law firm associates is subject to review by otherattorneys working on the case– the classifications they perform essentially flag documents that require closerexamination. The discovery process is highly sensitive, so several layers of expert oversight are required.For this reason, it is inappropriate to attempt to build and field an NLP doc review system intended tosupplant the entire process, or even the initial associate review. Rather, the intention is to apply naturallanguage processing techniques to augment and increase the efficiency of the process. While associates stillneed to examine documents in person, categorization by an NLP doc review system can greatly reduce boththe amount of time spent on this task as


View Full Document

Stanford CS 224 - Automating Document review

Documents in this Course
Load more
Download Automating Document review
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Automating Document review and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Automating Document review 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?