Homework 4 Solutions90-776 Manipulation of Large Data SetsHomework 4 Solutions1) Program/* u:\class\907776\class\HW4P1.SAS does the tasks in Homework 4, problem 1*//* In lab 4, I took the data set l:\academic\90776\data\text\country.txt, created an improved comma delimited ASCII file, then brought it into SASand saved the file as u:\class\90776\data\lab4.sd2. This program cleans upthat data set *//* Rob Greenbaum*//* 4/6/1999*/options pageno=1 ps=150 ls=150;/* create library reference for my SAS data sets */libname class 'l:\academic\90776\data';libname mydat 'u:\class\90776\data\';/* 1a. First, find the means (with proc means and univariat) of the numeric variables */PROC MEANS DATA = MYDAT.LAB4;/* I can see which observations are the outliers by including an ID subcommand in my univariate procedure*/PROC UNIVARIATE DATA=MYDAT.LAB4; VAR YR_IND; ID COUNTRY;/*1b. Let’s set negative yr_ind observations to missing */DATA FIXYR; SET MYDAT.LAB4; IF YR_IND <0 THEN YR_IND=.;/* Let’s see how the mean has changed */PROC MEANS DATA=C4;RUN;/*1c. Change the missing values for the character variables to blanks */DATA FIXCHAR; SET FIXYR;ARRAY ROB[*] _CHARACTER_; DO I=1 TO DIM(ROB); IF ROB[I]="?" OR ROB[I]="-" THEN ROB[I] = " ";END;DROP I;RUN;/* 1d. Print out the data set */PROC PRINT DATA=FIXCHAR;RUN;/* 1e. Check religion */PROC FREQ DATA=FIXCHAR; TABLES RELIGION;RUN;/* For one of the observations, Lutheran was misspelled as Lutherian. That is easy to fix */DATA FIXSPELL;SET FIXCHAR; IF RELIGION = “Lutherian” THEN RELIGION = “Lutheran”;/* MAKE SURE WE FIXED THE PROBLEM */PROC FREQ DATA=FIXSPELL; TABLES RELIGION;RUN;/* 1f. let’s create an ID variable */DATA NEWVAR; SET FIXSPELL;LENGTH CSMALL $2; /* need to reset the length of the new variable*/CSMALL = SUBSTR(COUNTRY,1,2); /* CSMALL is the first 2 chars of country name*/ENDYR = MOD(YR_UN,100); *Get the last 2 digits of the year they entered the UN;YR = PUT(ENDYR, 2.); /* convert ENDYR to CHARACTER*/IF YR = “ .” THEN YR=”00”; /*convert missings to 00*/ID = CSMALL||YR; /* put csmall and yr together*/RUN;/*1g. print out the data */PROC PRINT DATA=NEWVAR ; VAR COUNTRY ID;RUN;1) OutputThe SAS System 16:05 Tuesday, April 6, 1999 1 Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------- TOTALPOP 93 55462086.65 151888397 27816.00 1149667000 CAPPOP 91 2196361.60 2379700.09 134393.00 10726000.00 YR_IND 92 1796.02 395.9593435 -1000.00 1980.00 YR_UN 89 1954.08 10.6602754 1945.00 1991.00 --------------------------------------------------------------------1)a. There is at least one observation with a negative value for YR_IND. The SAS System 16:05 Tuesday, April 6, 1999 2 Univariate ProcedureVariable=YR_IND Moments Quantiles(Def=5) Extremes N 92 Sum Wgts 92 100% Max 1980 99% 1980 Lowest ID Highest ID Mean 1796.022 Sum 165234 75% Q3 1960 95% 1964 -1000(Ethiopia) 1964(Zambia ) Std Dev 395.9593 Variance 156783.8 50% Med 1922.5 90% 1962 800(Denmark ) 1971(Banglade) Skewness -4.69388 Kurtosis 27.93015 25% Q1 1821.5 10% 1499 836(Sweden ) 1975(Angola ) USS 3.1103E8 CSS 14267326 0% Min -1000 5% 900 843(France ) 1975(Mozambiq) CV 22.04647 Std Mean 41.28162 1% -1000 900(USSR ) 1980(Zimbabwe) T:Mean=0 43.50658 Pr>|T| 0.0001 Range 2980 Num ^= 0 92 Num > 0 91 Q3-Q1 138.5 M(Sign) 45 Pr>=|M| 0.0001 Mode 1960 Sgn Rank 2134 Pr>=|S| 0.0001 Missing Value . Count 1 % Count/Nobs 1.08 The SAS System 16:05 Tuesday, April 6, 1999 3 Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------- TOTALPOP 93 55462086.65 151888397 27816.00 1149667000 CAPPOP 91 2196361.60 2379700.09 134393.00 10726000.00 YR_IND 91 1826.75 265.9089064 800.0000000 1980.00 YR_UN 89 1954.08 10.6602754 1945.00 1991.00 --------------------------------------------------------------------1)b. The mean YR_IND has now increased to 1827 – before it was 1796. Removing the one outlier made a large difference. The SAS System 16:05 Tuesday, April 6, 1999 4 OBS COUNTRY TOTALPOP CAPITAL CAPPOP REGION YR_IND YR_UN RELIGION LANG1 LANG2 LANG3 1 Afghanistan 16922000 Kabul 1424400 Asia 1919 1946 Islam Pashto Persian2 Algeria 25888000 Algiers 1507241 Africa 1962 1962 Islam Arabic 3 Angola 10284000 Luanda 1134000 Africa 1975 1976 Portuguese 4 Argentina 32470000 Buenos A 2922829 Latin America 1816 1945 Catholicism Spanish 5 Australia 17337000 Canberra 310000 Oceania 1901 1945 English 6 Austria 7815000 Vienna 1487577 Europe 1918 1955 German 7 Bangladesh 108760000 Dhaka 5731000 Asia 1971 1974 Islam Bengali 8 Belgium 9978000 Brussels 137966 Europe 1830 1945 Dutch French German 9 Bolivia 7528000 La Paz 669400 Latin America 1825 1945 Catholicism Spanish Aymara Quechua 10 Brazil 153322000 Brasilia 1567709 Latin America 1822 1945 Portuguese 11 Bulgaria
View Full Document