STAT 2120: Notes on Topic 1 Introduction to Examining Distributions: • A variable records characteristics of cases (i.e., objects of interest) in its values. • Classify a variable by its possible values: o Categorical: records group labels; numeric labels mean nothing, except possible order. o Quantitative: records meaningful numbers; may be discrete or continuous • A time series is a record of values across time. • A variable’s distribution describes the counts or relative proportions of its values. • Exploratory data analysis seeks to describe distributions and relationships in data. Displaying a distribution with graphs: • Bar graphs and pie charts describe the distribution of a categorical variable. o Bar graphs emphasize counts; pie charts, proportions. o A Pareto chart is a bar graph with categories ordered by decreasing frequency. • Histograms are essentially bar graphs of a quantitative variable. o Bar-widths are not absolute; use equal bar-widths and “eyeball” for best picture. o Look for overall pattern, shape center, spread, deviations in shape, and “outlier” deviations. o A symmetric distribution is such that its histogram mirrors itself about its center. o A right- or left-skewed distribution shows a long tail to the right or left in its histogram. • Stemplots are back-of-the-envelope histograms drawn with the digits of quantitative values o “Stem” digits define bars; “leaf” digits display counts and sub-counts. o Customize by rounding digits and splitting stems. • Time plots graph time series values by time. o Emphasize patterns of change over time, such as trends and seasonal variations. Describing distributions with numbers: • Denote by ,…, the values of observations. • th percentile is a number such that percent of values fall on or below. • Describe a distribution with numerical summaries of shape, center, and spread. • A summary is resistant if it is insensitive to changes in skewness or extreme values. • Measure of center: mean, o ∑, the arithmetic average. o is not resistant. • Measure of center: median, o is the 50th percentile. o Calculate as the middle value or average of two middle values. o is resistant. • Measure of spread: extreme values o Smallest and largest values o Extreme values are not resistant. • Measure of spread: quartiles, and o is the 25th percentile; is the 75th percentile o Calculate and as medians of values falling to the left or right of (but not on) . o and are resistant. • Measure of spread: standard deviation, o √, where ∑, a rescaled average of squared-deviations from . o is the variance; 1 is “degrees of freedom;” square-root to match units with . o Calculate by computer. o is not resistant. • Measure of shape: mean-median comparisons: o if the distribution is symmetric; if right-skewed; if left-skewed • Useful descriptions of a distribution: o Summarize center and spread, e.g., as and (for symmetric, outlier-free distributions). o Display and graphically as “error bars.” o Five-number summary: smallest extreme, , , , largest extreme. o Display the five-number summary graphically as a box plot. Normal distributions: • A density curve is an idealization for describing patterns seen in histograms. o Denote by a variable representing an idealized observation. o “Area under the curve” in a range represents the proportion of observations in that range. o Total “area under the curve” is one. o The median is the point that divides “area under the curve” equally to the left and right. o Denote by and idealizations of and formulated on a density curve. o is the balance point. • Normal distributions are described by the class of density curves called “Normal curves.” o Symmetric, single-peaked, and bell-shaped. o Indexed by and , denoted ,. o mark a Normal curve’s inflection points.o 68-95-99.7 rule: For observations having a Normal distribution: 68% fall within ; 95% fall within 2; and 99.7% fall within 3. o The standard normal distribution is 0,1 • Suppose has a distribution with mean and standard deviation . The z-score, or standardized value, of is ⁄. o Measures “location from in units of .” o If is , then is 0,1. o To calculate an “area under the curve” for , translate to a z-score and use 0,1. • Calculations involving , might be forward (What proportion has ?) or backward (For what is the proportion of equal to ?) • A Normal quantile plot is a graph of percentiles of ,…, plotted against those of 0,1. o Plots on a straight line indicate a Normal distribution. o Calculate by computer. Introduction to Examining Relationships : • Approach: plot data, calculate summaries; look for patterns and deviations; consider idealizations • An explanatory (or independent) variable explains variability in the response (or dependent) variable. • Scatterplots: graph two quantitative variables measured on the same set of individuals. o Look for overall pattern; general deviations, “outlier” deviations. o Scatterplots are sometimes “smoothed” using algorithms that fit curves to the data. o A transformation (e.g., the log transformation) is sometimes applied to skewed data. o A scatterplot be extended by adding categorical variables, color- or symbol-coded. • The overall pattern of a relationship: o The form of a relationship may involve linear patterns, clusters, or lack of any pattern. o The direction may be positive or negative. o A stronger relationship is observed as points falling more closely to a clear from. Correlation: • Measure of direction and strength: correlation, o ∑, the rescaled average of the product of standardized deviations from and . o Calculate by
View Full Document