Slide 1overviewRecall …VarianceVarianceStandard errors of estimatesStandard errors of estimatesInterquartile rangesoutliersPROBABILITY AND STATISTICS IN COMPUTER SCIENCE AND SOFTWARE ENGINEERING Chapter 8: Introduction to Statistics1OVERVIEWWe’ve seen how to compute the mean, median, and quartiles/percentiles/quantiles for populations and samplesWe now explore other statistics that are often used: variance and standard deviationRecall that these values provided a measure of how much the distribution could “vary”We can then see how these concepts apply to populations and samplesWe’ll define interquartile ranges, and see how this helps us detect outliers2RECALL …For a random variable X, we had an expectation (or mean) and variance , which was defined by .If we took a sample of observations , we could estimate the population mean with the sample mean, .This estimator was unbiased () , consistent, and asymptotically normal.We also saw (page 213) that •=3VARIANCESuppose we have a sample . The sample variance is defined by the formulaThis measures the variability among the observations and estimates the population variance The sample standard deviation is the square root of the sample variance, i.e. . It measures variability in the same units as X and estimates the population standard deviation .Population and sample variance are in units squared•=4VARIANCEA simpler formula (computationally) is given byThe book (on page 220) shows how this is equivalent to the other formulationSee example 8.16 (same page) for a demonstration of how to compute this statisticThe book shows why the term in both formulations is useful – it ensures that the sample variance is unbiased. See the derivation on page 220-221.It can be shown that under certain assumptions the sample variance and standard deviation are also consistent and asymptotically normal•=5STANDARD ERRORS OF ESTIMATESWe can actually use the concepts of variance and standard deviation in another way: We can use them to measure precision and reliability of estimatorsBasically, we can approximate the variance of the estimator This helps us determine if the estimator is unbiased, for exampleSuppose we have an estimator for some population parameter. We define the standard error of the estimator to be its standard deviation, i.e. Given a set of estimations, we can of course compute the standard deviation of this sample with . These standard errors show how much estimators of the same parameter may vary if computed from different samples•=6STANDARD ERRORS OF ESTIMATESConsider the diagrams at the top of page 222:We can think of these as being a series of estimates for some population parameter , perhaps computed from multiple samplesThe diagrams show estimates that are biased and unbiased, have low standard error and have high standard errorNote biased implies the estimator is “biased” to one side or the other of the true valueThe standard error for the estimator is a measure of how likely it is that the estimator will be close to the actual parameter value – it is a measure of how “spread out” the distribution of the estimators isObviously we would like our estimator to be unbiased and have low standard deviation – see example 8.17 on page 221•=7INTERQUARTILE RANGESGenerally, we would like some mechanism by which we can identify outliers – these are sample observations that fall outside the “normal range” and thus may dramatically affect our computations of the sample mean and varianceOne approach: Consider inter-qu artile rangesWe can define the inter- q uartile range as the distance between the first and third quartiles, i.e.. This is another measure of the variability of the data, and can be estimated by the sample quartiles•=8OUTLIERSHow to identify outliers?One general rule is the rule of 1.5(IQR)For normal distributions, 99.3% of the distribution lies above and below . Check this for a standard normal distribution by looking at table A4The idea is to identify any observation that lies outside this range as an outlierCheck the example on page
View Full Document