Unformatted text preview:

Counting dataThus far, we have encountered data that were the results of measurements made on some process. Often, however, data take the form of counts of things, where a “thing” could be a quality, category, or any similar non-numeric division. For example, we might count the number of dendritic branchings that occurred on each of 5 types of growth media. The data (# of branchings) might look like the following:Medium:ABCDE# of branchings:1215201926Obviously, we probably have scientific reasons for believing that one or more growth media will do better than the others, but a statistical analysis allows us to compute the probability that the observed pattern of results could be due to chance alone. The total number of branchings is N=92 so, intuitively, we would expect about 92/5 branchings in each group if chance alone were operating. Of course, we won’t always get 92/5 branchings per group because of the very chance fluctuations with which we are concerned (in fact, we will never get 92/5 branchings because 92/5 = 18.4 and counts, by definition, are integers).In each group we observe Og branches and, hence, N – Og “non-branches”. If chance alone is operating and all growth media are equally supportive, the number of counts Og in each cell on a given experiment would be a random number from a binomial distribution with N = 42 92 and p = 1/5. To simulate how 92 branches would spread over 5 growth media by chance alone, we will generate 92 uniformly distributed random numbers and then count how many numbers are in each of 5 even intervals between 0 and 1:branchloc = rand(92,1); simcounts(1) = sum(branchloc < 1/5); simcounts(2) = sum(branchloc > 1/5 & branchloc < 2/5); simcounts(3) = sum(branchloc > 2/5 & branchloc < 3/5); simcounts(4) = sum(branchloc > 3/5 & branchloc < 4/5); simcounts(5) = sum(branchloc > 4/5); Or alternatively:branchloc = rand(92,1);ngroup = 5;for g=1:ngroup simcounts(g) = sum(branchloc > (g-1)/ngroup & branchloc < g/ngroup);endPSY 394U – Do-It-Yourself StatisticsThat’s it! We now have simulated data from an experiment in which chance alone is distributing the number of branches among cells given 92 total branchings. To do a Monte Carlo simulation of this “world” in which chance alone is operating, we need merely repeat this several times over. But what we need now is some test statistic that we can compute on each simulated experiment (to develop a sampling distribution) and that we can also compute on our real data, and then see where the statistic for the data falls relative to the sampling distribution.We know that the expected value in each cell is E = N*1/5 = 18.4 so a reasonable statistic could be formed by computing the difference between each cell count (real or simulated) and E, squaring it (so the negative differences don’t cancel out the positive ones), and summing this up to compute a sum-squared-error (which by this time should be very familiar to us!).An implementation in MATLAB is:ngroup = 5;totCount = 92;E = totCount / ngroup;nrep = 1000;simcounts = zeros(ngroup,1);simerrs = zeros(nrep,1);for i=1:nrep branchloc = rand(totCount,1); for g=1:ngroup simcounts(g) = sum(branchloc > (g-1)/ngroup & branchloc < g/ngroup); end simerrs(i) = sum((simcounts - E).^2);endrealcounts = [12, 15, 20, 19, 26];realerr = sum((realcounts - E).^2);hist(simerrs);line([realerr realerr], [0 300]);p = sum(simerrs >= realerr)/nrep;disp(p);The resulting plot and p-value that we got is below:PSY 394U – Do-It-Yourself StatisticsFigure 7.1. Sampling distribution of sum-square-error metric under the null assumption of equally supportive growth media. The observed error (113) is well within the distribution.Let’s return to our error metric, which is simply a sum-squared-difference. If we give it the arbitrary name (zeta), it would be written as mathematically as: Oi is the number of observed events (branchings, in this case) in each group, N is the total number of events, and k is the number of groups. Cleary, the distribution of this metric is going to change with both the number of groups, k (the more groups we sum over, the bigger we expect the metric to be overall), and the total number of things we are counting, N. Intuitively, if we are counting how many times a neuron spiked in one of two conditions over a 10 sec interval, and the total number of spikes was ~3000, then a discrepancy of 10 spikes between observed and expected would be relatively small. If, on the other hand, we were counting the number of times a rat chose to press each of two PSY 394U – Do-It-Yourself Statisticspossible levers across 20 total trials, then a discrepancy of 10 between observed and expected would be the maximum discrepancy possible. We can modify our metric above very easily to standardize across all possible values of N by dividing the squared difference by the expected value, and in so doing, we yield the traditional chi-squared “goodness-of-fit” test:The reason that the expected value, and not its square, or the binomial variance, N(1/k)(1 – 1/k), etc. is shown for the special case of k=2 in the appendix.Another common use of the chi-squared test is to assess the independence of two variables. Let us introduce this use with a somewhat whimsical example. Suppose a street entertainer approaches you and offers you the following wager. He has two decks of 40 cards each. One deck consists of 20 Kings and 20 Jokers, while the other consists of 20 Queens and 20 Jokers. You will pay him a dollar to play, and he will deal out 40 2-card hands, one card from each deck. If there are more than 10 royal couples (a King and a Queen), he keeps the dollar. If, on the other hand, there are 10 or fewer royal couples, he gives you two dollars. Assuming the game is fair, we can easily figure out what we should expect to happen (on average). For each hand, there should be a 0.5 probability of getting a King from the first deck, and a 0.5 probability of getting a Queen from the second deck. Thus, if the two cards for a given hand and truly drawn independently, then the probability of getting the hand (K,Q) is p(K)*p(Q) = 0.5*0.5 = 0.25. Over 40 hands, then, the expected value for the number of occurrences of (K,Q) is 0.25*40 = 10.This situation is summarized below in a contingency table. The first deck is represented in the rows and the second deck is represented in the columns such that each possible hand (or “contingency”)


View Full Document

UT PSY 394U - Counting Data

Documents in this Course
Roadmap

Roadmap

6 pages

Load more
Download Counting Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Counting Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Counting Data 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?