Sept. 25, 2008 LEC #1 ECON 240A-1 L. PhillipsExploratory Data AnalysisI. I. IntroductionAt the beginning of the course we will study three branches of statistics: (1) data analysis, (2) probability, and (3) statistical inference.Data analysis is the gathering, display and summary of data. We will use visual devices and quantitative measures to accomplish these tasks.Probability has its origins in gambling and the laws of chance. This topic is interesting in its own right but we will also use probability as a means to better understand the binomial distribution, the central limit theorem, and the relationship between the binomial distribution and the normal distribution.II. Data DescriptionOne use of statistics is to describe data with summary measures. Two notions are central tendency and dispersion.There are several measures of central tendency. An intuitive and relative easy measure to use is the mode, i.e. the data value that is observed most frequently. Of courseone issue is what if the data has two or three modes and has multiple peaks.Another measure of central tendency is the median. The data can be sorted and ordered from the highest value to the lowest, and the data point in the middle is the median, with one half of the data values above and one half of the data values below.Another measure of central tendency requiring some arithmetic is the sample mean of the data. Add up all the data values and divide by the number of observations or data points.III. Exploratory Data AnalysisSept. 25, 2008 LEC #1 ECON 240A-2 L. PhillipsExploratory Data AnalysisJohn Tukey developed exploratory data analysis to visually describe the characteristics of data. Two visual tools useful for this purpose are the stem and leaf diagram and the box and whiskers plot.An example of the methodology of the stem and leaf plot is its application to weight data from males and females at Penn State, taken from Larry Gonick & Woolcott Smith, The Cartoon Guide to Statistics(1993).Males: 140 145 160 190 155 165 150 190 195 138 160 155 153 145 170 175 175 170 180135 170 157 130 185 190 155 170 155 215 150 145 155 155 150 155 150 180 160 135 160 130 155 150 148 155 150 140 180 190 145 150 164 140 142 136 123 155 Females: 140 120 130 138 121 125 116 145 150 112 125 130 120 130 131 120 118 125 135 125 118 122 115 102 115 150 110 116 108 95 125 133 110 150 108For this illustration, the data is pooled without regard to gender. The first step is to determine the range of the data, the minimum weight and the maximum weight, 95 and 215, respectively. The second step is to construct the stem, counting by tens from 9 for 90, 10 for 100, etc. out to 21 for 210.-----------------------------------------------------------------------------------------------------9101112131415Sept. 25, 2008 LEC #1 ECON 240A-3 L. PhillipsExploratory Data Analysis161718192021Figure 1 : Stem of the Stem and Leaf Diagram-----------------------------------------------------------------------------------------------------------The third step is to construct the leaves: use the second digit of 95, the lowest weight, which is placed after 9 on the stem. There are three weights between 100 and 110: 102, 108, and 108 so the digits following 10 on the stem are 2, 8, 8. This is a leaf attached to the stem at 10. Continuing in this fashion:------------------------------------------------------------------------------------------------------------9: 510: 2 8 811: 6 2 8 8 5 5 0 6 012: 3 0 1 5 5 0 0 5 5 2 513: 8 5 0 5 0 6 0 8 0 0 1 5 3 14: 0 5 5 5 8 0 5 0 2 0 5 15: 5 0 5 3 7 5 5 0 5 5 0 5 0 5 0 5 0 0 5 0 0 0 16: 0 5 0 0 0 4 17: 0 5 5 0 0 0 18: 0 5 0 0Sept. 25, 2008 LEC #1 ECON 240A-4 L. PhillipsExploratory Data Analysis19: 0 0 5 0 020: 21: 5 Figure 2: Preliminary Leaves in the Stem and Leaf Diagram---------------------------------------------------------------------------------------------------------The last step is to order the digits composing the leaves. This provides a visualdescription of the data including the minimum, the maximum, the modes and the median.----------------------------------------------------------------------------------------------------------9: 510: 2 8 811: 0 0 2 5 5 6 6 8 8 12: 0 0 0 1 2 3 5 5 5 5 513: 0 0 0 0 0 1 3 5 5 5 6 8 8 14: 0 0 0 0 2 5 5 5 5 5 8 15: 0 0 0 0 0 0 0 0 0 0 3 5 5 5 5 5 5 5 5 5 5 716: 0 0 0 0 4 517: 0 0 0 0 5 5 18: 0 0 0 5 19: 0 0 0 0 520:21: 5Figure 3: Stem and Leaf DiagramSept. 25, 2008 LEC #1 ECON 240A-5 L. PhillipsExploratory Data AnalysisOf course this back of the envelope technology could be combined with using a computerto sort or order the data.In all there are 92 observations or data points. So the median would lie between the 46th and 47th observation, i.e. between 145 and 145 so the median is 145. Note the data is bimodal with ten 150’s and ten 155’s. The students have a reporting bias tending to round off to zeros and fives. IV. Dispersion One measure of dispersion is the interquartile range, IQR. Sort the data and put the points into four groups with equal numbers of observations. There will be two groups above the median and two groups below the median. If the median is a data point, add it to both the upper group and the lower group. In the case of the weight data, we had an even number of observations, and the median fell between two observations, the 46th and the 47th, which were both equal to 145. Next, find the median for the two high groups, i.e. the third quartile with 25 percent of the observations above it. Also find the median for the two lowest groups, i.e. the first quartile with 25 percent of the observations below it. The difference between the median for the highs and the median for the lows is the interquartile range.Having already done the work for the weight data by constructing the stem and leaf diagram, we can use it to determine the first quartile of 125 pounds, between the 23rd observation of 125 pounds and the 24th observation of 125 pounds. The third quartile is between the 23rd and 24th observation from the top, i.e. between 157 pounds and 155 pounds so the third quartile is 156 pounds, and the interquartile range is 156 minus 125 or31 pounds.Sept. 25, 2008 LEC #1 ECON 240A-6 L. PhillipsExploratory Data AnalysisJohn Tukey’s box and whiskers plot displays the interquartile range as well as other features of the data such as outliers. The left edge of the box is the first quartile and the right edge of the box is the third quartile. The median is drawn as a
View Full Document