Slide 1overviewhistogramhistogramStem and leafboxplotScatter and time plotsPROBABILITY AND STATISTICS IN COMPUTER SCIENCE AND SOFTWARE ENGINEERING Chapter 8: Introduction to Statistics1OVERVIEWLooking at graphs of distributions is often a great way to get an initial idea about the distributions parametersMay also be helpful to identity outliersThis may also be used to look for correlation between variablesThere are several types of data plots that are helpfulWe will explore these in this lecture; note most statistical packages provide the ability to create one or more of these plots2HISTOGRAMHistograms show the approximate shape of the pdf (or pmf for the underlying population)Standard construction: Create a series of “bins” to collect the data points, and then draw columns to represent the number of data points in each binThe overall shape (the heights of the bins represent the number of data points in each bin) gives a hint as to the distributionA relative frequency histogram is a histogram where the height of the column represents a proportion of the data in the bin to all of the data in the collected sample3HISTOGRAMIn general, if the sample is coming from a continuous distribution (e.g. time), we can think of the columns in the relative frequency histogram as approximating the area under the pdf curveLook at the example on page 225 and the curves on page 226 – we can draw assumptions about the underlying distributions by looking the shape of the histogramWe may also get an indication that the underlying distribution is actually a mixture of distributionsSee the “two hump” distribution on page 226Be careful when selecting bin “widths” – not too wide, not too narrow4STEM AND LEAFThese plots are similar to histograms but carry more informationWe can now get a better view of how the information is distributed within the columns …This works well with integer valued variables, but other variables can be scaled (translated) to work tooLook at the example on page 228 – this is basically the histogram turned sideways, with the actual numbers added in See also example 8.19 on page 2295BOXPLOTBoxplots show more information about the collected statistics, including minimum, maximum, median, and quartilesA typical 5-point summary for a sample of data includes the min, max, median, and first and third quartilesMean may also be included with a special symbol, like a crossObservations more than 1.5 interquartile range away from the median are usually shown as separate dots, indicating outliersAn example is show on page 230 … note the sample mean is includedOccasionally, we may break the data out into groups, for example group by daysSee the example on page 231 – this is also called “Candlestick” plot (stock prices)6SCATTER AND TIME PLOTSScatter plots are used to plot multiple variables (usually two)These plots can show a relationship between variablesCorrelationWe will also see (in Chapter 11) how statistical methods can be used to draw inference about missing data from these plots by finding trend linesOne specific example: When one of the variables is known (time), we can get a plot of data over time and look for trend linesSee the example on page
View Full Document