Statistical Description of DataStatistical Description of DataCf. NRiC, Chapter 14.Statistics provides tools for understanding data.In the wrong hands these tools can be dangerous!Here's a typical data analysis cycle:1. Apply some formula to data to compute a "statistic".2. Find where value falls in a probability distribution computed on the basis of some "null hypothesis".3. If it falls in an unlikely spot (on distribution tail), conclude null hypothesis is false for your data set.StatisticsStatisticsStatistics and probability theory are closely related. Statistics can never prove things, only disprove them by ruling out hypotheses.Distinguish between model-independent statistics (this class, e.g. mean, median, mode) and model-dependent statistics (next class, e.g. least-squares fitting).Will make use of special functions (e.g. gamma function) described in NRiC, Chapter 6.Moments of a DistributionMoments of a DistributionCf. NRiC §14.1.The mean, median, and mode of distributions are called measures of central tendency.The most common description of data involves its moments, sums of integer powers of the values.The most familiar moment is the mean: = =∑=VarianceVarianceThe width of the central value is estimated by its second moment, called the variance: or its square root, the standard deviation:Why N -1? If the mean is known a priori, i.e. if it's not measured from the data, then use N, else N -1. If this matters to you, then N is probably too small! = −∑= − =More on MomentsMore on MomentsA clever way to minimize round-off error when computing the variance is to use the corrected two-pass algorithm. First compute <x>, then do:The second sum would be zero if <x> were exact, but otherwise it does a good job of correcting RE in Var.Higher moments, like skewness (3rd moment) and kurtosis (4th moment) are also sometimes used. = −{∑= −−[∑= −]}Distribution FunctionsDistribution FunctionsA distribution function (DF) p(x) gives the probability of finding value between x & x + dx.The expected mean data value is:For a discrete DF:Similar to weighted means, e.g. center of mass.〈 〉 =∫−∞∞ ∫−∞∞ 〈 〉 =∑∑MedianMedianThe median of a DF is the value xmed for which larger & smaller values of x are equally probable:For discrete values, sort in ascending order, then:∫−∞ ==∫∞= / / / ModeModeThe mode of a probability DF p(x) is the value of x where the DF takes on a maximum value.Most useful when there is a single, sharp max, in which case it estimates the central value.Sometimes a distribution will be bimodal, with two relative maxima. In this case the mean and median are not very useful since they give only a "compromise" value between the two peaks.Comparing DistributionsComparing DistributionsOften want to know if two distributions have different means or variances (NRiC §14.2):1. Student's t-test for significantly different means.a) Find no. of standard errors ~ /N1/2 between two means.b) Compute statistic using nasty formula.c) Small numerical value indicates significant difference.2. F-test for significantly different variances.a) Compute F = Var1/Var2 and plug into nasty formula.b) Small value indicates significant difference.Comparing Distributions, Cont'dComparing Distributions, Cont'dGiven two sets of data, can generalize to a single question: Are the sets drawn from the same DF?Recall can only disprove, not prove.May have continuous or binned data.May want to compare one data set with known DF, or two unknown data sets with each other.Popular technique for binned data is the 2 test. For continuous data, use the KS test. NRiC §14.3.Chi-Square (Chi-Square (22) Test) TestSuppose have Ni events in ith bin but expect ni:Large value of 2 indicates unlikely match.Compute probability Q(2|) from incomplete gamma function, where is # degrees of freedom.For two binned data sets with events Ri and Si:=∑ −=∑−Kolmogorov-Smirnov (KS) TestKolmogorov-Smirnov (KS) TestAppropriate for unbinned distributions.From sorted list of data points, construct estimate SN(x) of the cumulative DF of the probability DF from which it was drawn...SN(x) gives fraction of data points to the left of x.Constant between xi's, jumps 1/N at each xi.Note SN(xmin) = 0, SN(xmax) =1.Behavior between xmin & xmax distinguishes distributions.KS Test, Cont'dKS Test, Cont'dStatistic is maximum value of absolute difference between two cumulative DFs.To compare data set to known cumulative DF:To compare two unknown data sets:Plug D and N (or Ne = N1N2/(N1 + N2)) into nasty formula to get numerical value of significance. = ≤≤∣ − ∣ = ≤≤∣ −
View Full Document