AMethod of Automated Nonparametric Content Analysis for Social Science

Home> Academic Documents> AMethod of Automated Nonparametric Content Analysis for Social Science

DOC PREVIEW

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A Method of Automated Nonparametric ContentAnalysis for Social ScienceDaniel J. HopkinsGeorgetown UniversityGary KingHarvard UniversityThe increasing availability of digitized tex t presents enormous opportunities for social scientists. Yet hand coding manyblogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although com-puter scientists have methods for automated content analysis, most are optimized to classify individual documents,whereas social scientists instead want generalizations about the population of documents, such as the proportion in agiven category. Unfortunately, even a method with a high percent of individual documents correctly classified can behugely biase d when estimating category proportions. By directly optimizing for this social sc ience goal, we develop amethod that gives approximately unbiased estimates of category proportions even when the optimal classifier perfor mspoorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about theU.S. presidency. We also make available software that implements our me thods and large corpora of text for furtheranalysis.Efforts to systematically categorize text documentsdate to the late 1600s, when the Church trackedthe proportion of printed texts which were non-religious (Krippendorff 2004). Similar techniques wereused by earlier generations of social scientists, includingWaples, Berelson, and Bradshaw (1940, which apparentlyincludes the first use of the term “content analysis”) andBerelson and de Grazia (1947). Content analyses like thesehave spread to a vast array of fields, with automated meth-ods now joining projects based on hand coding, and haveincreased at least sixfold from 1980 to 2002 (Neuendorf2002). The recent explosive increase in web pages, blogs,emails, dig itized books and ar ticles, transcripts, and elec-Daniel J. Hopkins is Assistant Professor of Government, Georgetown University, 681 Intercultural Center, Washington, DC 20057([email protected], http://www.danhopkins.org). Gary King is Albert J. Weatherhead III University Professor, Harvard University,Institute for Quantitative Social Science, 1737 Cambridge St., Cambridge, MA 02138 ([email protected], http://gking.harvard.edu).Replication materials are available at Hopkins and King (2009); see http://hdl.handle.net/1902.1/12898. Our special thanks to our inde-fatigable undergraduate coders Sam Caporal, Katie Colton, Nicholas Hayes, Grace Kim, Matthew Knowles, Katherine McCabe, AndrewProkop, and Keneshia Washington. Each coded numerous blogs, dealt with the unending changes we made to our coding schemes, andmade many important suggestions that greatly improved our work. Matthew Knowles also helped us track down and understand themany scholarly literatures that intersected with our work, and Steven Melendez provided invaluable computer science wizardry; both arecoauthors of the open source and free computer program that implements the methods described herein (ReadMe: Software for AutomatedContent Analysis; see http://gking.harvard.edu/readme). We thank Ying Lu for her wisdom and advice, Stuart Shieber for introducing usto the relevant computer science literature, and http://Blogpulse.com for getting us started with more than a million blog URLs. Thanksto Ken Benoit, Doug Bond, Justin Grimmer, Matt Hindman, Dan Ho, Pranam Kolari, Mark Kantrowitz, Lillian Lee, Will Lowe, AndrewMartin, Burt Monroe, Stephen Purpur a, Phil Schrodt, Stuart Shulman, and Kevin Quinn for helpful suggestions or data. Thanks also tothe Library of Congress (PA#NDP03-1), the Center for the Study of American Politics at Yale University, the Multidisciplinary Programon Inequality and Social Policy, and the Institute for Quantitative Social Science for research support.tronic versions of government documents (Lyman andVarian 2003) suggests the potential for many new ap-plications. Given the infeasibility of much larger scalehuman-based coding, the need for automated methods isgrowing fast. Indeed, large-scale projects based solely onhand coding have stopped altogether in some fields (Kingand Lowe 2003, 618).This article introduces new methods of automatedcontent analysis designed to estimate the primary quan-tity of interest in many social science applications. Thesenew methods take as data a potentially large set oftext documents, of which a small subset is hand codedinto an investigator-chosen set of mutually exclusive andAmerican Journal of Political Science, Vol. 54, No. 1, January 2010, Pp. 229–247C2010, Midwest Political Science Association ISSN 0092-5853229230 DANIEL J. HOPKINS AND GARY KINGexhaustive categories.1As output, the methods give ap-proximately unbiased and statistically consistent esti-mates of the proportion of all documents in each category.Accurate estimates of these document category proportionshave not been a goal of most work in the classification lit-erature, which has focused instead on increasing the accu-racy of classification into individual document categories.Unfortunately, methods tuned to maximize the percentof documents correctly classified can still produce sub-stantial biases in the aggregate proportion of documentswithin each category. This poses no problem for the taskfor which these methods were designed, but it suggeststhat a new approach may be of use for many social scienceapplications.When social scientists use formal content analysis, itis typically to make generalizations using document cat-egory proportions. Consider examples as far-ranging asMayhew (1991, chap. 3), Gamson (1992, chaps. 3, 6, 7,and 9), Zaller (1992, chap. 9), Gerring (1998, chaps. 3–7),Mutz (1998, chap. 8), Gilens (1999, chap. 5), Mendel-berg (2001, chap. 5), Rudalevige (2002, chap. 4), Kellstedt(2003, chap. 2), Jones and Baumgartner (2005, chaps.3–10), and Hillygus and Shields (2008, chap. 6). In allthese cases and many others, researchers conducted con-tent analyses to learn about the distribution of classifi-cations in a population, not to assert the classificationof any particular document (which would be easy to dothrough a close reading of the document in question). Forexample, the manager of a congressional office would finduseful an automated method of sorting indiv idual con-stituent letters by policy area so they can be routed to themost informed staffer to draft a response. In contrast, po-litical scientists would be interested primarily in


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Please select your school