UI STAT 5400 - Computing in Statistics

Unformatted text preview:

122S:166Computing in StatisticsLecture 22Nov. 10, 2008Kate Cowles374 [email protected] checking and validation (“clean-ing”)• making sure that raw data were accuratelyentered into a computer-readable file• checking that character va ri ables contain onlyvalid values• checking that numeric values are within pre-determined ranges• checking whether there are missing values forvaria bles where complete data are necessary• checking for and eliminating duplicate records• checking for uniqueness of certain values, suchas patient IDs• checking fo r i nvalid date values• checking that an ID number is present in eachof several related files• verifying that more complex multi-file ruleshave been followed3– example: if an adverse event of type X oc -curs in one dataset, you expe ct an observa-tion with the same ID number in anotherdata set. In addition, the date of this ob-servation must be after the ad verse eventand before the end of the study.from Cody’s Data Cleaning Techniques Using SAS Software by Ron Cody,SAS Institute, 1999.4Example dataset: Patients.dat001M11/11/1998 88140 80 10002F11/13/1998 84120 78 X0003X10/21/1998 68190100 31004F01/01/1999101200120 5AXX5M05/07/1998 68120 80 10006 06/15/1999 72102 68 61007M08/32/1998 88148102 0M11/11/1998 90190100 0008F08/08/1998210 70009M09/25/1999 86240180 41010f10/19/1999 40120 10011M13/13/1998 68300 20 41012M10/12/98 60122 74 0013208/23/1999 74108 64 1014M02/02/1999 22130 90 1002F11/13/1998 84120 78 X0003M11/12/1999 58112 74 0015F 82148 88 31017F04/05/1999208 84 20019M06/07/1999 58118 70 0123M15/12/1999 60 10321F 900400200 51020F99/99/9999 10 20 8 0022M10/10/1999 48114 82 21023f12/31/1998 22 34 78 0024F11/09/199876 120 80 10025M01/01/1999 74102 68 51027FNOTAVAIL NA 166106 70028F03/28/1998 66150 90 30029M05/15/1998 41006F07/07/1999 82148 84 105Files on course web page• data file: patients.dat• SAS program: patients172.sas6SAS Code to read in the data*----------------------------------------------------------------*| PROGRAM NAME: PATIENTS.SAS IN C:\CLEANING || PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS || DATE: MAY 29, 1998 |*----------------------------------------------------------------*;OPTIONS FORMCHAR = "|----|+|---+=|-/\<>*" LINESIZE = 75 NODATE;* LIBNAME CLEAN "C:\CLEANING";*DATA CLEAN.PATIENTS;DATA PATIENTS;*INFILE "C:\temp\patients.dat" PAD;INFILE "/group/ftp/pub/kcowles/datasets/patients.dat" PAD;INPUT @1 PATNO $3.@4 GENDER $1.@5 VISIT MMDDYY10.@15 HR 3.@18 SBP 3.@21 DBP 3.@24 DX $3.@27 AE $1.;LABEL PATNO = "Patient Number"GENDER = "Gender"VISIT = "Visit Date"HR = "Heart Rate"SBP = "Systolic Blood Pressure"DBP = "Diastolic Blood Pressure"DX = "Diagnosis Code"AE = "Adverse Event?";7FORMAT VISIT MMDDYY10.;RUN;8New aspects of this data step• PAD option on infile statement– adds blanks to the end of short recordsto the default logical record length or alength specified by another infile op-tion, lrecl– prevents skipping to the n ext record ofdata when a shorter line is encountered• @ in input statement– tell SAS at which numeric column to beginreading each varia ble– needed when there are no delimiters be-tween varia ble values in data file• formats after each vari a ble name– how many digits or characters in each value– identify character variables wi th $– MMDDYY10. is built-in SAS format forreading dates9– must end w ith period10Proc contents: Getting SAS to describethe contents of a dataset/*************************************************************************Extra code: getting a description of the dataset**************************************************************************PROC CONTENTS DATA = PATIENTS ;RUN ;The SAS SystemThe CONTENTS ProcedureData Set Name: WORK.PATIENTS Observations: 31Member Type: DATA Variables: 8Engine: V8 Indexes: 0Created: 8:43 Friday, June 6, 2003 Observation Length: 40Last Modified: 8:43 Friday, June 6, 2003 Deleted Observations: 0Protection: Compressed: NOData Set Type: Sorted: NOLabel:-----Engine/Host Dependent Information-----Data Set Page Size: 8192Number of Data Set Pages: 1First Data Page: 1Max Obs per Page: 203Obs in First Data Page: 31Number of Data Set Repairs: 0File Name: /usr/tmp/SAS_workEB5E00003805_mouse/patients.sas7bdat11Release Created: 8.0202M0Host Created: HP-UXInode Number: 44658Access Permission: rw-------Owner Name: UNKNOWNFile Size (bytes): 16384-----Alphabetic List of Variables and Attributes-----# Variable Type Len Pos Format Label----------------------------------------------------------------------8 AE Char 1 39 Adverse Event?6 DBP Num 8 24 Diastolic Blood Pressure7 DX Char 3 36 Diagnosis Code2 GENDER Char 1 35 Gender4 HR Num 8 8 Heart Rate1 PATNO Char 3 32 Patient Number5 SBP Num 8 16 Systolic Blood Pressure3 VISIT Num 8 0 MMDDYY10. Visit Date12Validity checks on character variablesVariable Vali d valuesGender F, MDXnumerals 1 through 999AE0,1• Are there o ther (invalid) values for these vari-ables in the dataset• Are there missing val ues?• Which observations in the dataset contain in-valid o r missin g values?13Using proc freq to list all distinct val-ues of a character variable that appearin the dataset/*************************************************************************Program 1-2 Using PROC FREQ to list all the unique values for characterVariablesaa^M**************************************************************************PROC FREQ DATA=CLEAN.PATIENTS;TITLE "Frequency Counts for Selected Character Variables";TABLES GENDER DX AE / NOCUM NOPERCENT;RUN;Frequency Counts for Selected Character VariablesThe FREQ ProcedureGenderGENDER Frequency-------------------2 1F 12M 14X 1f 2Frequency Missing = 114Diagnosis CodeDX Frequency---------------1 72 23 34 35 36 17 2X 2Frequency Missing = 8Adverse Event?AE Frequency---------------0 191 10A 1Frequency Missing = 115Using proc print to list invalid charac-ter values and identify the observations• where statement in many procedures will ex-clude observations that don’t meet a givenlogical condition• simple logi ca l conditions involve comparingthe value in a variable to some specified valu e• example: where hr > 150• example: where gender not in (’M’ ’F’’ ’)16/*************************************************************************Program 1-4 Using PROC PRINT to list invalid character values^M**************************************************************************PROC PRINT DATA=CLEAN.PATIENTS;TITLE "LISTING OF INVALID


View Full Document

UI STAT 5400 - Computing in Statistics

Documents in this Course
Load more
Download Computing in Statistics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computing in Statistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computing in Statistics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?