DOC PREVIEW
UI STAT 5400 - Computing in Statistics

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

122S:166Computing in StatisticsData validation and descriptionProc formatLecture 21Oct. 28, 2009Kate Cowles374 SH, [email protected] checking and screening• important to do prior to any multivariateanalyses• purpose: to identify incorrect, invalid, or oth-erwise suspect data• begin with simple descriptive statistics andplots for each variable• types of checks for binary, nominal, and or-dinal data– frequency and proportion of invalid cate-gories– frequency and p ro portion of missing clas-sifications– adequate representation of categories of in-terest?3• types of checks for continuous data– range screens– consistency screens– accuracy of measurementPrimary question for the applied statistician:Does this make sense?4Example: the Berkeley Guidance Studyoptions linesize = 70 pagesize = 60 nodate nonumber ;data berkboy ;infile ’/group/ftp/pub/kcowles/datasets/berkboy.dat’ ;input wt2 ht2 wt9 ht9 lg9 st9 wt18 ht18 lg18 st18 soma ;run ;proc corr ;run ;proc reg data = berkboy ;model soma = ht2 wt2 ht9 wt9 st9 ;run ;proc reg data = berkboy ;model soma = ht9 wt9 st9 ;run ;5The CORR Procedure11 Variables: wt2 ht2 wt9 ht9 lg9 st9wt18 ht18 lg18 st18 somaSimple StatisticsVariable N Mean Std Dev Sumwt2 26 214.53846 8.37726 5578ht2 26 13.59231 1.61862 353.40000wt9 26 88.40000 3.03592 2298ht9 26 31.58462 4.35850 821.20000lg9 26 136.54615 5.31603 3550st9 26 27.53077 1.89626 715.80000wt18 26 71.30769 10.69119 1854ht18 26 71.58077 11.56509 1861lg18 26 180.03846 6.39619 4681st18 26 36.33846 2.72882 944.80000soma 26 210.42308 25.26210 5471Simple StatisticsVariable Minimum Maximumwt2 201.00000 228.00000ht2 11.30000 17.20000wt9 81.30000 92.20000ht9 24.50000 43.10000lg9 125.40000 146.00000st9 24.20000 32.40000wt18 45.00000 98.00000ht18 50.30000 110.20000lg18 169.40000 195.100006st18 31.00000 44.10000soma 152.00000 252.000007Pearson Correlation Coefficients, N = 26Prob > |r| under H0: Rho=0wt2 ht2 wt9 ht9 lg9 st9wt2 1.00000 0.09354 0.28546 -0.07875 0.08547 -0.072090.6495 0.1575 0.7022 0.6781 0.7264ht2 0.09354 1.00000 0.49768 0.57922 0.38230 0.580660.6495 0.0097 0.0019 0.0539 0.0019wt9 0.28546 0.49768 1.00000 0.53122 0.77583 0.284460.1575 0.0097 0.0052 <.0001 0.1590ht9 -0.07875 0.57922 0.53122 1.00000 0.62049 0.905530.7022 0.0019 0.0052 0.0007 <.0001lg9 0.08547 0.38230 0.77583 0.62049 1.00000 0.353320.6781 0.0539 <.0001 0.0007 0.0766...8The REG ProcedureModel: MODEL1Dependent Variable: somaAnalysis of VarianceSum of MeanSource DF Squares Square F Value Pr > FModel 5 2295.78296 459.15659 0.67 0.6491Error 20 13659 682.92816Corrected Total 25 15954Root MSE 26.13289 R-Square 0.1439Dependent Mean 210.42308 Adj R-Sq -0.0701Coeff Var 12.41921Parameter EstimatesParameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 51.66778 293.72149 0.18 0.8621ht2 1 0.42206 4.47469 0.09 0.9258wt2 1 -0.22891 0.70601 -0.32 0.7491ht9 1 0.29498 4.16397 0.07 0.9442wt9 1 1.20330 3.00337 0.40 0.6929st9 1 3.13973 8.73447 0.36 0.72309The REG ProcedureModel: MODEL1Dependent Variable: somaAnalysis of VarianceSum of MeanSource DF Squares Square F Value Pr > FModel 3 2216.26058 738.75353 1.18 0.3390Error 22 13738 624.45844Corrected Total 25 15954Root MSE 24.98917 R-Square 0.1389Dependent Mean 210.42308 Adj R-Sq 0.0215Coeff Var 11.87568Parameter EstimatesParameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 33.23337 259.54880 0.13 0.8993ht9 1 0.68933 3.65283 0.19 0.8520wt9 1 0.90587 2.32088 0.39 0.7001st9 1 2.73651 7.41980 0.37 0.715810• SAS procedures for describing b inary, ordi-nal, and nominal data– proc freq– proc chart (or gchart)– proc tabulate• SAS procedures for describing quantitativedata– proc means– proc univariate∗ most thorough description∗ also does one-sample t-tests– proc tabulate11Example: AIDS Clinical Trials Group(ACTG) Protocol 320• randomized , double-bli nd, placebo-controlledclinical trial• eligibility criteria– HIV-infected adults– CD4 counts <= 200 and at least 3 monthsof prior zidovudine therapy• two treatment groups– 3-drug regimen: indinavir, lamivud ine, andeither zidovudine or stavudine– 2-drug regimen: zidovudine and lamivu-dine• 1156 patients randomized12• patients stratified according to their CD4 countat study entry– ≤ 50 cells/mm3– 50-200 cells/mm3• primary endpoint: occurrence of an AIDS-defining event or death13CD4 and RNA data from ACTG 320• blood specimens col lected at study entry andat weeks 4, 8, 24, and 40 during follow-up foranalysis of CD4 counts and viral load• patients included in the present analysis– 198 patients who were randomly selectedfor a virology substudy– ACTG 320 dataset available for purchasefrom National Technical Information Ser-vice inclu des clinical endpoints and CD4data for all patients but viral load dataonly on these 198.14The baseline data• data collected one time on each patient atthe time of entry into the study• documentation that came with purchased datasaid this file had been written using the fol-lowing SAS codedata _null_;file "basedata.dat" lrecl=36;set a.basedata;put@1 pidnum@7 sex@9 raceth@11 ivdrug@13 hemophil@15 karnof@19 avecd4@25 priorzdv@28 age;run;15Further documentation5) BASEDATA.SSD01**************# Variable Type Len Pos Label-----------------------------------------------------------------8 AGE Num 8 56 Age (years)6 AVECD4 Num 8 40 Baseline CD4 Count4 HEMOPHIL Num 8 24 Hemophiliac? (1=yes, 0=no)3 IVDRUG Num 8 16 IV drug history5 KARNOF Num 8 32 Karnofsky score9 PIDNUM Num 8 647 PRIORZDV Num 8 48 Months prior ZDV2 RACETH Num 8 8 Race/ethnicity1 SEX Num 8 0 Sex (1=male, 2=female)HEMOPHIL Hemophiliac? (1=Yes, 0=No)IVDRUG IV drug history (1=Never,2=Currently,3=Previously)KARNOF Karnofsky Performance Scalecoding: 100 = Normal; no complaint; no evidence of disease90 = Normal activity possible; minor signs/symptomsof disease80 = Normal activity with effort; some signs/symptomsof disease’70 = Cares for self; normal activity/active worknot possible’RACETH Race/ethnicitycoding: 1=White Non-Hispanic2=Black Non-Hispanic3=Hispanic (Regardless of Race)4=Asian, Pacific Islander5=American Indian, Alaskan Native166=Other/unknownSEX Sex (1=Male, 2=Female)17• so I read it in and did descriptive statisticson each variable using following codeoptions linesize = 72 ;proc format ;value sexfmt 1 = ’M’ 2 = ’F’ ;value racefmt 1 = ’W’ 2 = ’B’ 3 = ’H’ 4 = ’A’ 5 = ’NA’ 6 = ’O’ ;value yesfmt 1 = ’Y’ 0 =


View Full Document

UI STAT 5400 - Computing in Statistics

Documents in this Course
Load more
Download Computing in Statistics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Computing in Statistics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Computing in Statistics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?