Identifying expression differences in cDNA microarray experiments Lecture 20 Statistics 246 April 6 2004 1 Introduction Many microarray experiments are carried out to find genes which are differentially expressed between two or more samples of cells Examples abound cells from the liver say in a mouse with a gene knocked out compared with liver cells in a normal mouse of the same strain cells in one region of the brain say cerebellum compared with cells in a different region say the anterior cingulate region tumor cells in some organ say the liver compared with normal cells from the same organ cells from an organism say yeast after a treatment say by heat or cold or a drug compared with cells of the same kind in the untreated state cells from some part of a developing organ or organism at one time compare with cells of the same kind at a later time and so on It should be clear that the number of such comparisons is limited only by the imagination of the biologist at least at the moment when details of so many genetic programs drug response development tumorigenesis are 2 incomplete Preliminary remarks Initially comparative microarray experiments were done with few if any replicates and statistical criteria were not used for identifying differentially expressed genes Instead simple criteria were used such as fold change with 2 fold being a popular cut off This was sometimes done without regard to the variability present in the experiment and depending on the experiment could be too liberal naming genes that were not differentially expressed false positives or errors of the first kind or too conservative failing to identify genes that were differentially expressed false negatives errors of the second kind The relative importance of false positives and false negatives depends on the context the aims of the experiment e g were the investigators seeking broad patterns or specific genes and the follow up experiments envisaged e g validation of findings by a more precise technique It did not take long for people to want to assign statistical significance to their findings concerning differentially expressed genes Could p values be attached confidence statements be made and so on These questions raised a number of issues which were unfamiliar to the molecular biologists doing the experiments replication systematic versus random differences multiplicity of tests etc 3 Preliminary remarks cont It was eventually realized on that biological replicates are important for reaching statistically sound conclusions but even here the story is not simple as many systematic features remain even in experiments with independent replicates The fact that many thousands of comparisons were being carried out made use of traditional cut offs 2SD or p 0 05 inappropriate also became clear to people Strict control of type 1 error rates we ll be more precise in the next lecture turned out to be asking too much in this microarray context and different criteria for controlling errors rates have come to the fore most notably false discovery rates FDR However there still remains a need for appropriate theory Modelling large scale microarray experiments is not a simple task the difference between assuming and actually having independence can be great Where are we now Depending on the context some researchers can make use of traditional multiple testing procedures Others make use of FDR notions which are more widely applicable but lead to weaker conclusions Yet others have realized that is is unreasonable given their sample sizes to expect statistical significance for all their results and instead seek 4 evidence of biological significance and validation by suitable follow up Preliminary remarks concluded Where are we now cont Two issues have come to the fore in recent years The first is the interpretation of the lists of genes determined by whatever means to be differentially expressed What kinds genes are they i e what is their function DNA binding protease What cellular pathways or processes are involved replication cell death Where in the cell are these genes operating in ribosomes mitochondria The question becomes are genes of a given class over represented in the list of differentially expressed genes As there are many classes there are many such questions and so the issue of multiplicity comes up once more The second is the determination of significance for sets of genes given a priori attempting to answer some form of the question is this specific set of genes differentially expressed The idea here is that a particular set of genes say those involved in oxidative phosphorylation might have all or many changed a little and that this pattern of change might be significant although the individual changes are not Of course we won t begin with just one set of genes but many so multiplicity questions arise here too In both of these questions how we rank the genes will be important and the 5 cut offs less so Some motherhood statements Important aspects of a statistical analysis include Tentatively separating systematic from random sources of variation Removing the systematic and quantifying the random when the system is in control Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements 6 The simplest cDNA microarray data analysis problem is identifying differentially expressed genes using one slide This is a common enough hope Efforts are frequently successful It is not hard to do by eye The problem is probably beyond formal statistical inference valid pvalues etc for the foreseeable future why In the next two panels genes found to be up or down regulated in an 8 treatment Srb1 over expression versus 8 control comparison are indicated in red and green respectively on plots of the data from single hybridizations Also depicted are confidence lines determined by different people which claim to be able to delineate differentially expressed genes using just one hybridization slide the different lines corresponding to different methods and or different confidence levels What do we see 7 Matt Callow s Srb1 dataset 5 Newton s and Chen s single slide method 8 Matt Callow s Srb1 dataset 8 9 Newton s Sapir Churchill s and Chen s single slide method The second simplest cDNA microarray data analysis problem is identifying differentially expressed genes using replicated hybridizations There are a number of
View Full Document