Genes and MS in Tasmania completed Lecture 7 Statistics 246 February 12 2004 1 Towards a sharing statistic Our aim was to come up with a statistic that effectively describes haplotype sharing differences between case and control haplotypes The sharing statistic should be largest at markers closest to a disease locus as haplotype sharing there should extend the furthest the association of disease with particular haplotypes should be strongest 2 Nonparametric haplotype sharing analysis Why nonparametric rather than likelihood based methods Likelihood methods make assumptions regarding the genealogy of the population and we don t how many of these assumptions are robust to violations Likelihood methods are computationally intensive especially for genome wide scans where these is a need to maximize over the very large state space of possible ancestral haplotypes MCMC Likelihood methods have a hard time at the HLA region because the LD there is extremely high and non uniform block like structure Simpler statistics will probably do better here unless we can model background LD 3 Haplotype sharing statistics for genome wide scan data cf fine mapping Previous usually likelihood based statistics have concentrated on fine mapping and the exact localization of a variant allele They assume a signal exists For us localization was not the primary interest Rather detection was our main interest using a genome wide scan We needed something that was not as computationally intensive as DHSMAP McPeek Strahs 1999 BLADE Liu et al 2001 DMLE Rannala Reeve 2001 or the shattered coalescent Morris et al 2002 4 Haplo clusters Melanie Bahlo Calculates a sharing statistic at every marker Obtains a p value at every marker using a permutation test Allows for several clusters of ancestral haplotypes allelic heterogeneity 5 Testing for shared haplotypes Score for haplotype sharing log p Pter 3 3 1 2 5 7 7 9 5 2 2 3 9 6 1 3 9 1 1 7 1 5 2 9 8 3 3 3 1 3 1 2 7 7 5 1 4 1 1 1 6 6 6 6 1 3 3 2 10 10 10 10 3 2 5 7 1 1 1 9 1 1 7 5 5 5 5 1 2 5 1 3 4 4 2 1 3 9 5 4 3 1 1 2 1 7 1 2 2 3 3 5 9 9 3 2 5 Qter Cases 2 4 6 8 1 Controls 2 5 6 Sharing drop off allelic heterogeneity Marker Proportions of Cases Proportions of Controls 1 2 3 4 Cluster 1 haplotypes Cluster 2 haplotypes neither cluster 1 nor 2 haplotypes 7 Haplo cluster in action Example Sorting on marker 1 for a sample of 3 case and 4 control haplotypes 213 112 213 123 214 133 Cases Controls 312 Haplotype 1 2 3 Controls 3 0 1 Cases 0 3 0 After sort on haplotype consisting only of marker 1 calculate a chi square statistic and move on Haplotype 11 12 13 21 31 Controls 1 1 1 0 1 Cases 0 0 0 3 0 After sorting on haplotype consisting of marker 1 and marker 2 calculate a chi square statistic and Eventually stop and sum the chi square statistics Then repeat for a 8 suitably large number of random permutations of cases and controls Statistic to evaluate haplotype sharing K Sharing statistic is 2 based using the idea of multiple ancestral haplotypes clusters which are grown starting at each marker examined in the scan Significance is evaluated via a permutation test choose a random permutation of the pooled cases and controls and recalculate the statistic repeat 20 000 times K Si i 2 j k j 1 k 1 i 2 j k 12 test for associationbetweenthe number of case and control haplotypes still sharing the ancestral haplotype of cluster k at marker j after starting at marker i A recursive form for the estimator and and the SD of the p value was used to enable early termination of program 9 The permutation test The idea is this We have 170 cases and 105 controls and at any particular locus we calculate the value of our statistic calling it S Now pool our cases and controls into 275 individuals and sample 170 to be cases at random from the 275 calling the remainder controls For this first artificial set of cases and controls calculate the value of our statistic S1 say Next we repeat this procedure 9 999 more times say obtaining values S2 S3 S4 S10 000 As long as 10 000 is sufficiently many random permutations we can get a good estimate of the p value of our initial statistic relative to our empirically estimated null distribution as p i Si S 10 000 10 Exercises 1 How should we decide what number of resamplings is large enough 2 Explain in the simple case of a 2 2 table of cases and controls cross classified as diseased and healthy how using all possible resamplings rather than a fixed size random sample leads to the p value for the exact test 3 To avoid carrying out an unnecessarily large number of permutations the proportion of resampled values of our statistic exceeding the value S can be monitored Can you describe a stopping rule for the random resamplings that should lead to accurate enough p values without going to the full number each time 11 Haplo clusters Output opt 1 Genetic distances used to decide order of markers to sort on c 1 The number of clusters of haplotypes to look for 1 miss 1 The missing data is replaced randomly using the 2 marker haplotype information share 5 The number of haplotypes needed to share 5 The standard deviation p values are calculated to 0 01 phat Marker names have been provided and will be used in the output files of case haplotypes 338 of contol haplotypes 208 of markers 11 of perms 100000 Marker Mapdistance Chi Square p sd p log p perms D21S1911 0 5 34 4 44e 01 4 44e 03 0 35 12510 D21S1904 0 85 6 17 3 63e 01 3 63e 03 0 44 17577 D21S1899 10 36 5 89 4 37e 01 4 37e 03 0 36 12876 D21S1922 16 46 2 97 6 83e 01 6 83e 03 0 17 4636 D21S1884 17 26 4 74 4 14e 01 4 14e 03 0 38 14135 D21S1914 20 82 6 49 3 38e 01 3 38e 03 0 47 19571 D21S263 28 97 4 06 5 24e 01 5 24e 03 0 28 9077 D21S1252 39 41 1 18 8 66e 01 8 65e 03 0 06 1553 D21S1919 42 51 1 38 8 51e 01 8 51e 03 0 07 1751 D21S1255 43 81 2 24 7 24e 01 7 24e 03 0 14 3805 D21S266 51 51 3 86 5 70e 01 5 70e 03 0 24 7557 12 Haplo clusters Output II Table of haplotypes Marker Cluster Haplotype Length Haplotype D21S1911 D21S1904 D21S1899 D21S1884 1 6 of haplos 5 Chi square 0 2 D21S1922 D21S1884 D21S1914 D21S263 D21S125 3 82 0 0 3 163 3 2 8 22 0 1 11 2 1 …
View Full Document