Examining DistributionsExamining Distributions- IntroductionChapter 1Variables A variable records characteristics of idiid l/(ibj f i ) i iindividual/cases(i.e., objects of interest) in its values. A variable’s distribution describes the counts or relative proportions of its values.Examining DistributionsExamining Distributions- Describing Distributions with GraphsgpSection 1.1Some graphical statistics Bar graphs and pie charts describe the distribution of a categorical variableof a categorical variable. A Pareto chart is a bar graph with categories ordered by decreasing frequencyordered by decreasing frequency. Histograms are essentially bar graphs of a quantitative variablequantitative variable. Stemplots are back-of-the-envelope histograms d ith th di it f tit ti ldrawn with the digits of quantitative values. Time plots graph time series values by time.HistogramsUse equal bar-widths and “eyeball” for best pictureDecember2004state nemplo ment ratesDecember 2004 state unemployment rates.Interpreting histogramsLook for shape, center, and spread.Too much detail Visualize a smooth curve highlighting the overall patternDistribution shapesSymmetric distributionRight-skewed distributionComplex, multimodal distributionInterpreting histogramsLook for deviations, like outliers.Alaska FloridaStemplotStemLeavesSplit stemSplit stemDecember 2004 state unemployment rates.Examining DistributionsExamining Distributions- Describing Distributions with NumbersgSection 1.2Measure of center: the mean58.264.058.264.059.5 64.560.7 64.160.9 64.861.9 65.261.9 65.762.2 66.262.2 66.762.4 67.162.9 67.863.9 68.963.1 69.663.9Heights (in.) of 25 womenMeasure of center: the median110.6221.2331.64419110.6221.2331.6441.9Step 1: Sort x1, …, xn.441.9551.5662.1772.3882.39925551.5662.1772.3882.3992.5Step 2.a: If n is odd,M = middle value992.510 10 2.811 11 2.912 3.313 3.41413610 10 2.811 11 2.912 12 3.313 3.414 1 3.6M = 3.41413.615 2 3.716 3 3.817 4 3.918 5 4.11964215 2 3.716 3 3.817 4 3.918 5 4.119 6 4.2M = (3.3+3.4)/2 = 3.351964.220 7 4.521 8 4.722 9 4.923 10 5.324115620 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6Step 2.b: If n is even,M = avg. of two middle values24115.625 12 6.1ComparisonsSymmetrySymmetryLeft skew Right skewObserve:Observe: The mean is “pulled” by outliers.The median isresistantto outliersThe median is resistantto outliers.Measure of spread: the quartilesQ1= 2.2110.6221.2331.6441.9551.5662.1772.3812.39225The first quartile, The third quartile, Qith di f922.510 3 2.811 4 2.912 5 3.313 3.4q,Q1, is the median of values below M.Q3, is the median of values above M.M = 3.4Q3= 4.3514 1 3.615 2 3.716 3 3.817 4 3.9185411854.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.125 6 6.1Max=61Five-number summary and boxplot24 5 5.623 4 5.322 3 4.921 2 4.720145Max 6.17Q3= 4.352014.519 6 4.218 5 4.117 4 3.91633856eathM = 3.41633.815 2 3.714 1 3.613 3.41263.334Years until de63311 5 2.910 4 2.8932.5822.312YQ1= 2.2712.3662.1551.5441.93316Disease X0331.6221.2110.6Min = 0.6Measure of spread: the standard deviation58.2 64.0, where59.5 64.560.7 64.160.9 64.861.9 65.261.9 65.762.2 66.262.2 66.762.4 67.162.9 67.863.9 68.963.1 69.663.9Note: Calculate by computerHeights (in.) of 25 womenSummarizing distributionsFive number summary Error barsQMaxMQ3QQ1Min(Resistant) (Not resistant)Examining DistributionsExamining Distributions-The Normal DistributionsThe Normal DistributionsSection 1.3Density curvesA density curve is a mathematical idealization of a histogramIdealizationActual“Adth”ti f b ti“Area under the curve” ≈proportion of observations.Other idealizationsHistogram Density curveMedian halves “area under the curve”The mean is the balance pointThe mean is the balance pointExamplesHave easy mathematical formulasformulasNo easy formulayNormal distributions“Exponential” functionThe normal curves:xxProperties: Symmetric, single-peaked, and bell-shaped. Indexed by μand σ, denoted N(μ, σ)yμ,(μ,) μ±σmark inflection points.Impact of μ and σSame μ, different σDifferent μ, same σμ,The 68-95-99.7 RuleIf x is N(μ, σ): 68% of obs. within μ±σ95% of obs. withinμ±2σ95% of obs. within μ±2σ 99.7% of obs. within μ±3σStandardizationA z-score measures the location of x from μ in units of σ,Key property: If x is N(μ, σ) then z is N(0, 1).Benefit: To calculate an “area under the curve” for N()tlttdN(0 1)N(μ, σ)translate to a z-score and use N(0, 1).“Standard Normal” distributionExample calculation: heightsProblem: Heights, x, is N(64.5, 2.5).Fht tifi di id l i67?Forwhat proportion of individuals is x< 67?Solution:Ask: How far is c = 67 from μ = 64.5 in units of σ = 2.5?(c – μ) / σ = (67 – 64.5) / 2.5 = 1Translate: z = (x – μ) / σ is N(0, 1)For what proportion of individuals isz<1?For what proportion of individuals is z 1?Calculate: normsdist(1) = 0.84Example calculation: heights (cont)68-95-99.7 rule: Proportion with-1<z<1is068Proportion with 1 < z< 1 is 0.68Equally divide remaining between z < -1 and z > 1Ptiith< 1 i 0 16 + 0 68 0 84Proportion with z< 1 is 0.16 + 0.68 = 0.840.680.160.16Calculation of “area between”Problem: Proportion with c1< z < c2Solution:(prop withz<c2)–(prop withz<c1)Solution:(prop. with z< c2) (prop. with z< c1)ElPtiith14<<22Example: Proportion with 1.4 < z< 2.2.normsdist(2.2) – normsdist(1.4)= 0.9861 – 0.9192 = 0.0669Backward calculationsProblem: For what c is p the proportion with z < c?Solution:c=normsinv(p)Solution: c normsinv(p)El0.68Examples:normsinv(0.84) = 1 0.160.16normsinv(0.16) = -1Example calculation: mpgProblem: MPG, x, of compact cars is N(24.7, 5.88).Fhtd 10% fth?Forwhat cdoes 10% ofcompact cars have x> c?Solution: First, normsinv(0.90) = 1.28Translate: z = (x – μ) / σ is N(0, 1)10% of compact carshave z > 1.28 = (c – μ) / σSolve:1.28=(c–24.7) / 5.88Solve:1.28 (c24.7) / 5.88 ⇒ c = 24.7 + (1.28)(5.88) =332= 33.2Examining RelationshipsExamining RelationshipsScatterplotsSection 2.1Examining relationshipsOften, individuals are measured in more than one variablevariable Fll th h bfFollow the same approach as before: Plot data and calculate numerical summaries Look for overall patterns and deviations Consider suitability of mathematical models (later)Examining relationshipsAdditional considerations:Do some variables tend to vary together?Do some variables tend to vary together? Do some variables explain variability in
View Full Document