1Introduction to Descriptive Statistics17.871Types of VariablesNominal(Qualitative)“categorical”~Nominal(Quantitative)OrdinalInterval orratioKey measuresDescribing data--KurtosisPeaked--SkewnessSkewRange,Interquartile rangeVariance (standard deviation)SpreadMode, medianMeanCenterNon-mean based measureMomentKey distinctionPopulation vs. Sample Notations, bμ, σ, βRomansGreeksSamplevs.PopulationMeanXnxnii≡≡∑=μ1Variance, Standard Deviationσμσμ≡−≡−∑∑==niiniinxnx12212)(,)(2Variance, S.D. of a Samplesnxsnxniinii≡−−≡−−∑∑==122121)(,1)(μμDegrees of freedomBinary data)1()1(1 timeof proportion1)(2xxsxxsxXprobXxx−=⇒−=====Normal distribution example IQ SAT Height “No skew” “Zero skew” Symmetrical Mean = median = mode ValueFrequency22/)(21)(σμπσ−−=xexfSkewnessAsymmetrical distribution GPA of MIT students “Negative skew” “Left skew”ValueFrequencySkewness(Asymmetrical distribution) Income Contribution to candidates Populations of countries “Residual vote” rates “Positive skew” “Right skew”ValueFrequencySkewnessValueFrequency3KurtosisValueFrequencyk > 3k = 3k < 3leptokurticplatykurticmesokurticNormal distribution Skewness = 0 Kurtosis = 322/)(21)(σμπσ−−=xexfMore words about the normal curveThe z-scoreor the“standardized score”zxxx=−σCommands in STATA for getting univariate statistics summarize varname summarize varname, detail histogram varname, bin() start() width() density/fraction/frequency normal graph box varnames tabulate [NB: compare to table]Example of Sophomore Test Scores High School and Beyond, 1980: A Longitudinal Survey of Students in the United States (ICPSR Study 7896) totalscore = % of questions answered correctly minus penalty for guessing recodedtype = (1=public school, 2=religious private, 3 = non-sectarian private)4Explore totalscore some more. table recodedtype,c(mean totalscore)--------------------------recodedty |pe | mean(totals~e)----------+---------------1 | .37297352 | .44755483 | .589883--------------------------Graph totalscore. hist totalscore0 .5 1 1.5 2Density-.5 0 .5 1totalscoreDivide into “bins” so that each bar represents 1% correct hist totalscore,width(.01) (bin=124, start=-.24209334, width=.01)0 .5 1 1.5 2Density-.5 0 .5 1totalscoreAdd ticks at each 10% markhistogram totalscore, width(.01) xlabel(-.2 (.1) 1)(bin=124, start=-.24209334, width=.01)0 .5 1 1.5 2Density-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1totalscoreSuperimpose the normal curve (with the same mean and s.d. as the empirical distribution). histogram totalscore, width(.01) xlabel(-.2 (.1) 1) normal(bin=124, start=-.24209334, width=.01)0 .5 1 1.5 2Density-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1totalscoreHistograms by category.histogram totalscore, width(.01) xlabel(-.2 (.1)1) by(recodedtype)(bin=124, start=-.24209334, width=.01)0 1 2 30 1 2 3-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 11 23DensitytotalscoreGraphs by recodedtypePublic Religious privateNonsectarian private5Main issues with histograms Proper level of aggregation Non-regular data categoriesA note about histograms with unnatural categoriesFrom the Current Population Survey (2000), Voter and Registration SurveyHow long (have you/has name) lived at this address? -9 No Response-3 Refused-2 Don't know-1 Not in universe1 Less than 1 month2 1-6 months3 7-11 months4 1-2 years5 3-4 years6 5 years or longerSolution, Step 1Map artificial category onto “natural” midpoint-9 No Response Æ missing-3 Refused Æ missing-2 Don't know Æ missing-1 Not in universe Æ missing1 Less than 1 month Æ 1/24 = 0.0422 1-6 months Æ 3.5/12 = 0.293 7-11 months Æ 9/12 = 0.754 1-2 years Æ 1.55 3-4 years Æ 3.56 5 years or longer Æ 10 (arbitrary)Graph of recoded dataFractionlongevity01 2 3 4 5 6 7 8 9 100.557134histogram longevity, fractionlongevity01 2 3 4 5 6 7 8 9 10015Density plot of dataTotal area of last bar = .557Width of bar = 11 (arbitrary)Solve for: a = w h (or).557 = 11h => h = .051Density plot template.0511154.55715+ yr..07242.14043-4 yr..15121.15291-2 yr..09.5001½.04307-11 mo..22.417½1/12.09091-6 mo..19*.0821/120.0156< 1 mo.Height (density)X-lengthX-maxX-minFractionCategory* = .0156/.0826Draw the previous graph with a box plot. graph box totalscore-.5 0 .5 1Upper quartileMedianLower quartile}Inter-quartilerange}1.5 x IQRDraw the box plots for the different types of schools. graph box totalscore, by(recodedtype)-.5 0 .5 1-.5 0 .5 11 23Graphs by recodedtypeDraw the box plots for the different types of schools using “over” option-.5 0 .5 1123graph box totalscore, over(recodedtype)Three words about pie charts: don’t use them So, what’s wrong with them For non-time series data, hard to get a comparison among groups; the eye is very bad in judging relative size of circle slices For time series, data, hard to grasp cross-time comparisonsSome words about graphical presentation Aspects of graphical integrity (following Edward Tufte, Visual Display of Quantitative Information) Represent number in direct proportion to numerical quantities presented Write clear labels on the graph Show data variation, not design variation Deflate and standardize money in time
View Full Document