**Unformatted text preview:**

summary is via the sample mean y y1 y2 yn ny y1 y2 yn n This is a special case of 1 Aggregation Fundamentally statistics is about learning from data One of the most common mistakes in everyday life is over extrapolating from a sample size of 1 or a too small sample size but how should data be combined The most basic way to reduce data y1 y2 yny1 y2 yn to a single number integration How and whether to aggregate data often leads to subtle questions e g Simpson s paradox can arise and it may not immediately clear which data are relevant to use for answering which questions At the President s Invited Address at the 2014 Joint Statistical Meetings which is by far the largest statistics conference in the US Stephen Stigler introduced Stigler s seven pillars of statistical wisdom in which aggregation was the first pillar 2 Probability In probability going from the PDF probability density function of a continuous random variable XX to the probability that XX is between aa and bb requires an integral we integrate the PDF from aa to bb Likewise if XX is a discrete random variable we do a sum 3 Averaging This overlaps with aggregation but here I specifically want to talk about E X E X the expected value of a random variable XX This is the most common notion of the average of a random variable though there are others too such as the median It generalizes the sample mean e g it can be a weighted average but again is a special case of integration Again and again in statistics we want to study the average value of a quantity or see how an average evolves across time or space or study how the sample mean of a data set relates to the expected value of the population from which the data were drawn e g this is done in the law of large numbers and central limit theorem 4 Summarizing the shape of a distribution Probability problems are often content to find the expected value E X E X but statistics problems tend to emphasize quantifying the uncertainty and variability of estimates The most common way to do that is with the standard deviation which is a measure of the spread of a distribution At first that might seem to require completely different mathematical concepts but in fact it is defined by SD X E X E X 2 SD X E X E X 2 So by computing two expected values we can get the standard deviation Going beyond expected value and standard deviation two of the most common descriptions of distributions are the skewness and kurtosis both of which are also defined in terms of expected value and thus in terms of integration 5 Loss functions The expected value E X E X is the most popular summary of XX but is it the best in some sense Why not use the median or some other summary Optimality questions can be made more precise by defining a loss function and then trying to minimize the expected loss For example if the loss associated with guessing that XX will equal cc is X c 2 X c 2 then choosing c E X c E X minimizes the expected loss In practice this loss function known as mean squared error is very widely used If instead the loss is X c X c then choosing cc to be a median of the distribution of XX minimizes the expected loss So understanding medians requires understanding expectation Furthermore in computing the median for a continuous distribution we look for the point where the area under the PDF curve to the left of the point is 1 2 which again involves the notion of integration 6 Prediction One of the most important tasks in statistics is to predict future observations Most commonly this is done with a regression model If we want to predict YY based on some observed variable XX a main approach is to compute the conditional expectation E Y X E Y X which is in the mean squared error sense the best prediction of YY as a function of XX Many different models and criteria are used in practice for prediction In linear regression the most common criterion is ordinary least squares OLS which is based on minimizing the sum of squares of the differences between predicted and actual values of the variable we re predicting again this is integration An example of a fancier form of regression that is very popular these days is Lasso which involves both the sum of squared differences and the sum of the absolute values of the estimated coefficients 7 Bias variance tradeoff A principle that is ubiquitous in statistics is the bias variance tradeoff It is fundamental in evaluating estimation techniques and in choosing a model to capture the data reasonably well but avoid overfitting See What is an intuitive explanation for bias variance tradeoff and What is the best way to explain the bias variance trade off in layman s terms The bias of an estimator for an unknown parameter is the expected difference of the estimator and the parameter and the variance is the square of the standard deviation Both bias and variance are defined in terms of expectation and hence rely on integration From https www quora com Stats 101 What You Need To Know About Statistics eLearning Industry A statistic is a a quantity calculated from a set of data Useful statistics help describe the characteristics of a data set This lesson will introduce three basic statistics the mean median and mode Mean Median A mean is the average of a set of numbers The mean is calculated by summing all of the numbers then dividing this total by the amount of numbers in the data set To demonstrate we ll calculate the mean of the following data set This data set contains the test scores of a group of 1515 test takers Scores 55 68 71 72 72 72 76 80 84 84 88 90 90 95 98Scores 55 68 71 72 72 72 76 80 84 84 88 90 90 95 98 Adding up all of these scores gives us 1 1951 195 therefore we can calculate the mean test score as 1 195 15 79 71 195 15 79 7 Try calculating the mean of this data set 16 19 19 20 21 2516 19 19 20 21 25 You should get 2020 If you sort a data set in ascending or descending order the number in the middle is the median The data set of test scores from above is already sorted The middle number in the list is 8080 which is the median The second data set is also sorted but it has an even number of data points so their is no middle number in the list When this occurs the median is calculated by averaging the two middle numbers in this case 1919 and 2020 Thus our median is 19 20 2 19 5 19 20 2 19 5 Data sets will not always …

View Full Document