Cuckoo Birds Cuckoo birds have a behavior in which they lay their eggs in other birds nests The other birds then raise and care for the newly hatched cuckoos Cuckoos return year after year to the same territory and lay their eggs in the nests of a particular host species Furthermore cuckoos appear to mate only within their territory Therefore geographical sub species are developed each with a dominant foster parent species A general question is are the eggs of the different sub species distinct so that they are adapted to a particular foster parent species Specifically we can ask are the mean lengths of the cuckoo eggs the same in the different sub species Analysis of Variance Bret Larget Department of Statistics Statistics 371 Fall 2004 2 University of Wisconsin Madison Display of Cuckoo Bird Egg Lengths Analysis of Variance November 18 2004 Analysis of variance ANOVA is a statistical procedure for 24 23 21 22 25 Here is a plot of egg lengths mm of cuckoo bird eggs categorized by the species of the host bird 20 HedgeSparrow MeadowPipet PiedWagtail Robin TreePipet analyzing data that may be treated as multiple independent samples with a single quantitative measurement for each sampled individual ANOVA is a generalization of the methods we saw earlier in the course for two independent samples The bucket of balls model is that we have I different buckets of balls each of which contains numbered balls The populations means and standard deviations of the numbers in each bucket are i and i respectively for i 1 I In ANOVA we often assume that all of the population standard deviations are equal Wren Statistics 371 Fall 2004 birdSpecies Statistics 371 Fall 2004 3 Statistics 371 Fall 2004 1 22 yij the jth observation in the ith group 24 This notation is used to describe calculations of variability within samples and variability among samples although for historical reasons of poor grammar the term between samples is more commonly used 25 A Dotplot of the Data 23 Notation I the number of groups 21 ni the ith sample size y i the mean of the ith sample I X 20 n ni the total number of observations i 1 PI Pnj i 1 j 1 yij y the grand mean n Statistics 371 Fall 2004 HedgeSparrow 6 Sums of Squares within Groups nj I X X i 1 j 1 I X i 1 yij y i 2 ni 1 s2 i Notice that this measure of variability is a weighted sum of the sample variances where the weights are the degrees of freedom for each respective sample Statistics 371 Fall 2004 PiedWagtail Statistics 371 Fall 2004 Robin TreePipet Wren 4 The Big Picture We measure variability by sums of squared deviations The sums of squares within groups or SS within is a combined measure of the variability within all groups SS within MeadowPipet 7 ANOVA is a statistical procedure where we test the null hypothesis that all population mean are equal versus the alternative hypothesis that they are not all equal The test statistic is a ratio of the variability among sample means over the variability within sample means When this ratio is large this indicates evidence against the null hypothesis The test statistic will have a different form than what we have previously seen The null distribution is an F distribution named after Ronald Fisher An ANOVA table is an accounting method for computing the test statistic We introduce a lot of new notation on the way Statistics 371 Fall 2004 5 Sums of Squares Between Among Degrees of Freedom Means We measure variability by sums of squared deviations The sums of squares between groups or SS between is a measure of the variability among sample means SS between I X i 1 The degrees of freedom within samples is simply the sum of degrees of freedom for each sample This is equal to the total number of observations minus the number of groups ni y i y 2 Notice that this measure of variability is a weighted sum of the deviations of the sample means from the grand mean weighted by sample size Statistics 371 Fall 2004 10 Degrees of Freedom df within ni 1 i 1 n I Statistics 371 Fall 2004 8 Mean Square Within The degrees of freedom between samples is simply the number of groups minus one In ANOVA a mean square will be the ratio of a sum of squares over the corresponding degrees of freedom SS within df within 2 n1 1 s2 1 nI 1 sI n I In other words the mean square within is a weighted average of the sample variances where the weights are the degrees of freedom within each sample The square root of the mean sqaure within is the estimate of the common variance for all the I populations df between I 1 MS within spooled Statistics 371 Fall 2004 I X 11 Statistics 371 Fall 2004 q MS within 9 The F Statistic Mean Square Between The F statistic is the ratio of the mean square between over the mean square within MS between MS within If the populations are normal the population means are all equal the standard deviations are all equal and all observations are independent then the F statistic has an F distribution with I 1 and n I degrees of freedom An F distribution is positive and skewed right like the chi square distribution but it has two separate degrees of freedom the numerator degrees of freedom and the denominator degrees of freedom If X1 and X2 are independent 2 random variables with k1 and k2 degrees of freedom respectively then In ANOVA a mean square will be the ratio of a sum of squares over the corresponding degrees of freedom F SS between df between 2 n1 1 s2 1 nI 1 sI n I MS between X1 k1 X2 k2 has an F distribution with k1 and k2 degrees of freedom F Statistics 371 Fall 2004 14 ANOVA Table for the Cuckoo Example fit aov eggLength birdSpecies anova fit Analysis of Variance Table Statistics 371 Fall 2004 12 Total Sum of Squares If we treated all observations as coming from a single population which would be the case if all population means were equal and all population standard deviations were equal as well then it would make sense to measure deviations from the grand mean This is the total sum of squares Response eggLength Df Sum Sq Mean Sq F value Pr F birdSpecies 5 42 940 8 588 10 388 3 152e 08 Residuals 114 94 248 0 827 Signif codes 0 0 001 0 01 0 05 0 1 1 SS total In R the columns are in an unconventional order and there is no row for totals R names the row corresponding to between by the corresponding categorical variable R names the row corresponding to within Residuals There are six groups and 120 total observations which explains the degrees of freedom column Each mean square is the ratio of the corresponding sum of squares
View Full Document