Unformatted text preview:

Statistical Design and the Analysis ofAt first glance replicate #2 looks different – problem?Classifying genes as expressed if Pr{Eg|Ygj = y}>0.5, the number of expressed genes for each replicate is 55, 36, and 58, respectively. Since we expected only 32 “expressed” genes, this indicates a large number of false positives in replicates 1 and 3.Statistical Design and the Analysis of Gene Expression Microarray DataM. Kathleen Kerr and Gary A. ChurchillAuthors contend that statistical design and inference should play an important role in microarray studies.They are particularly concerned with getting the maximum information out of microarray data, and say that statistics is equipped to handle such data.They compare a gene’s expression measured from a microarray spot to an agricultural variety grown in a field. Just as R. A. Fisher estimated both the yield of varieties and the effect of blocks (fields), the authors suggest simultaneous estimation of gene expression and the effect of the spots on the array.Use a two-way ANOVA (analysis of variance).ANOVA can be used to test the null hypothesis that a group of population means are equal, determine which are different and estimate those differences.A general form of a two-way ANOVA would be:yij =  + Bi + Vj + ijyij : measured yield for variety j grown on block i : overall meanBi : effect of block iVj : effect of variety jij : errorFor microarray data, varieties are mRNA samples (from different individuals, tissues, timepoints, etc.).We’re interested in testing whether expression level differs across samples for the same gene.Can’t do ANOVA if data are reduced to dye ratios.Experimental design for microarrays:The authors recommend not using reference samples.They say that references are an inefficient use of microarray resources, that more information could begained by running more samples of interest instead.(?): Biologists think of references as a set of controls.Are we as confident in an experiment w/o controls?A two-dye microarray system can be thought of as blocks of size two. Compared to putting a reference in each block, a “balanced design” with no referencesdoubles the amount of data produced.(?): Is the two-block “reference design” scenario a fair analogy to the way biologists use reference samples in microarrays? What about using just a small # of references with each print group?The authors advocate replication, in order to(1) increase the precision of estimation, and(2) supply an estimate of error.Replication can be done at several different levels: on same array, different arrays, or using distinct (but theoretically equivalent) mRNA samples.Lastly, the authors suggest a balanced design with respect to dyes. Research has indicated that some genes incorporate one dye better than the other, beyond the overall dye effect usually observed. To account for this possibility, a sample should be run with both red and green dyes (in different spots).ANOVA model for microarray data: They explicitly model as many of the interactions andsources of experimental variation as they can:yijkgr =  + Ai+ Dj+ (AD)ij+ Gg+ (AG)igr+ (DG)jg+ (VG)kg+ ijkgryijkgr is the signal measured on spot r for gene g on array i for dye k and variety (mRNA sample) k. is the overall average signal in the experiment.Ai, Dj, and (AD)ij account for overall variation in dyes and arrays (data normalization).Gg is the gene effect - overall average signal for gene.(AG)igr is the spot effect.(DG)jg is the gene-specific dye effect.(VG)kg is the expression attributable to this variety.ijkgr is the error.Since we are looking for changes in expression that are due to the variety (the particular mRNA sample), (VG)kg is the quantity of interest.Assumptions and potential problems mentioned:ANOVA assumes that effects are additive. In this case, a logarithmic scale might be more realistic.“Truncation at the high end of the data”, either due tosaturation of the probe or the limits of the scanner.Treatment of error as homoscedastic – but there might be a different degree of error for expressed vs. unexpressed mRNA's (addressed in Lee et al.). Treating effects as fixed or random. In particular, theauthors advocate treating (VG)kg as random. (?): Is there a compelling argument for doing so?Importance of replication inmicroarray gene expression studies:Statistical methods and evidence fromrepetitive cDNA hybridizationsMei-Ling Ting Lee, Frank C. Kuo,G. A. Whitmore, and Jeffrey SklarIn order to demonstrate the need for replication in microarray studies, the researchers set up a simple controlled experiment. For each of the 288 genes in this experiment, three replicates were printed onto the same slide. They expect that 32 of these genes will appear to be highlyexpressed because they contain repetitive sequences which should cross-hybridize to other sequences.(?): They are using RNA samples from human tissue, so why don’t they expect some genes to actually be expressed (rather than giving a false indication of it)?Just one dye (Cy3) is used for each spot. The authors’interest is the variation in signal between replicates.(?): Which types of variation will this experimental setup detect, and which types will be missed?They should see variation due to the location of spotswithin a slide, like print tip effects and slide surface.Since all three replicates of a gene are on one slide, they will not see between-slide variation.Since they are only using one dye, they will not be able to observe any dye effects.(?): If a researcher compares two samples in the same spot (rather than using a reference), is the type of variation observed in this study likely to affect the conclusions of such a microarray study?(2.1) Statistical approach – definitions:Xgj is a ratio of fluorescence for gene g and replicate j.Ygj is the log ratio: ln(Xgj)Eg (script E in paper) is the event of mRNA for gene g being present in the sample (gene is expressed).p is the prior probability of observing Eg for each g.Recall that Kerr and Churchill had mentioned the assumption of homoscedasticity. In contrast, Lee et al. propose two distinct distributions for Ygj depending on whether a gene is expressed or not.Both are assumed to have normal distributions.(2.2) Mixed normal probability density function is:fj(y) = pfEj(y) + (1 – p)fUj(y)Ej: anticipated outcome of being expressed. (Uj: un)From this is derived a posterior


View Full Document

CORNELL CS 726 - Study Notes

Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?