MIT 17 871 - How to Use the STATA infile and infix Commands - D2408100

Home> Schools> Massachusetts Institute of Technology> Political Science (17) > 17 871> How to Use the STATA infile and infix Commands

DOC PREVIEW

MIT 17 871 - How to Use the STATA infile and infix Commands

School name Massachusetts Institute of Technology

Course 17 871- Political Science Laboratory

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Note that all files that STATA reads must end with a carriage return.17.871Spring 2007How to Use the STATA infile and infix Commands STATA is a very flexible program, allowing you to read-in and manipulate data in many differentforms. This is good, because social science data come in various formats, requiring greatflexibility among the statistical packages social scientists use. Unfortunately, the STATA manualwe are using only covers how to input the simplest of data sets. The simplicity of the examples inthat book border on the trivial. The purpose of this handout, therefore, is to introduce you to theuse of the STATA infile and infix commands, going a little more in depth than the Hamilton bookgoes.An easy caseLet us say that you have data about four students who have taken a standardized test. You havetheir first names, their ages, and their scores on two tests (Test 1 and Test 2). Here are the data intabular form:Name Age Test 1 Test 2Bob 18 95 18Carol 21 43 27Ted 14 67 9Alice 12 23 31The easiest way to get these data into STATA is for you to fire up the STATA Data Editor andjust type the data into the spreadsheet-like interface. The next-easiest way to get the data into STATA is for you to type the data into a file and then letSTATA read it in. Let's say your Athena username is janedoe. You could create a data file usinga text editor such as EMACS. Let's say you saved the data in a file in your directory namedscores.dat. The file scores.dat looks like the following.1Exhibit 1Bob 18 95 18Carol 21 43 27Ted 14 67 9Alice 12 23 31Then, from within STATA you would type the following:22Earlier versions of STATA (i.e., versions 6 and earlier) limited variable names to 8characters in length. I continue to maintain that convention, for compatibility reasons.3This is actually no longer true with STATA 8. However, there are copies of STATA 7floating around, so this statement will always work, regardless of the STATA version you’reusing.4There are important exceptions that will be dealt with below.infile str5 name age test1 test2 using /mit/janedoe/scores.datThe word infile is the command name. The words name, age, test1, and test2 are the variablenames. STATA variable names must be 322 characters long, or shorter, and begin with aletter or underscore (_). STATA generally assumes that variables contain numbers. If the dataare not numeric, STATA needs to be told that a variable is non-numeric (a text "string") and thelongest the text string can be. That is the function of the word str5 before the word name: tospecify that name is a text string that may be no longer than 5 characters long.3After you have typed in the infile command, you should then issue the compress command. Thatis because STATA has some tricky memory management problems, and this command willconvert all the variables to their most efficient internal representations.An example with fixed field dataThe above example is the simplest case of reading in data for use by STATA. In addition to beinga short, narrow data set, we are able to express this data set using what we call a "free form"format: the data are just freely typed into the computer, with nothing but a space to separatevariable values. Data sets are rarely this simple. For instance, if you had someone with a firstname of "Mary Jane" you would have to get rid of the space (by typing in something likeMaryJane or Mary_Jane). If you had thousands of observations (instead of four) and variables(instead of four) the spaces necessary to delimit individual observations might cause the data setto balloon beyond what is really necessary to contain the unique information among the data. Forthese, and other, reasons, data sets are typically organized using a "fixed format". With fixedformat organization, each line begins a new observation4 and each variable occupies the samecolumn(s) on each line. The fixed format version of the data would look something like the following:Exhibit 2Bob 189518Carol214327Ted 1467 9Alice1223313To read in this data, you would use the STATA infix command. This is what you would type toread in the data from Exhibit 2:infix str5 name 1-5 age 6-7 test1 8-9 test2 10-11 using scores.datA word about missing dataSometimes data will be missing from a data set. There are three ways of indicating missing datain STATA: (1) the lone period, (2) missing value codes, and (3) blanks.The lone periodSTATA generally represents missing values with a lone period where the value of the variableshould be. For instance, say that Ted would not tell us his age. We could account for this fact byplacing a period where his age should go, either in free form:Bob 18 95 18Carol 21 43 27Ted . 67 9Alice 12 23 31or in fixed format:Bob 189518Carol214327Ted .67 9Alice122331STATA would then exclude Ted from any calculations or procedures that required the use of theage variable.Missing value codesMost social science data sets use missing value codes to indicate missing values. It is mostcommon to give someone an impossible value for that variable when the variable's true value ismissing, and then to tell the statistical program about that value. So, for instance, ages must bepositive. Therefore, we could make the value of -1 indicate a missing age, give Ted an age of -1,and then tell STATA what we've done. The data would look like this:Bob 189518Carol214327Ted -167 9Alice1223314There are then two ways to change a missing value code into an actual missing valuerepresentation in STATA. The most general way is to use the replace command:replace age=. if age== -1The above command tells STATA to replace values of age with the missing value representationin those cases where age equals -1.This technique can get tedious if you have lots of variables with the same missing value code.STATA has a command, mvdecode, which converts missing values to their proper representation.For instance, if age, test1, and test2 all used -1 for missing data, you could issue the followingsingle command to accommodate the missing values:mvdecode age test1 test2, mv(-1) Beginning with STATA version 8, you can use up to 27 different missing value codes. Forinstance, let’s say you’re working on a public opinion survey. One question asks if the personvoted for president in the last election. The next question asks for whom the respondent voted,but only if the person reported in the previous question that she had voted for president. In thiscase, you will have missing data on the second

View Full Document