U of I CS 466 - Statistical Testing with Genes - D1722694

Home> Schools> University of Illinois> Computer Science (CS) > CS 466> Statistical Testing with Genes

U of I CS 466 - Statistical Testing with Genes

Course Cs 466- Advanced Computer Architecture

Pages 26

Download Save

Unformatted text preview:

Statistical Testing with GenesHypergeometric testSlide 3Hypergeometric DistributionSlide 5“Enrichment” analysisGene OntologyMotivation for GOConsistent and Structured organizationGene Ontology databaseStructure of GOGO databaseAccessing GOStatistical TestsStatistical Testing with genes: another exampleMultiple genesComplete Null HypothesisWhat we’d likeTo be more specificControlling false positivesFalse Discovery RateSome definitionsSlide 23A difference between p-values and FDRsThe “Procedure”Benjamini-Hochberg procedureStatistical Testing with GenesSaurabh SinhaCS 466Hypergeometric test•A population of N genes•Microarray data identified a subset of n genes as being up-regulated in cancer•Also know a set of m genes involved in cell division•Want to test for association between cancer and cell-divisionHypergeometric testPopulation (N)Cancer (n) Cell Div (m)Cancer and Cell Div (k)Is the intersection (size “k”) significant large, given N, m, n?Hypergeometric DistributionGiven that m of N genes are labeled “cell division”.If we picked a random sample of n genes, how likelyis an intersection equal to k ?Hypergeometric DistributionGiven that m of N genes are labeled “cell division”.If we picked a random sample of n genes, how likelyis an intersection equal to or greater than k ?P  f ( j ;N,m,n)jk“Enrichment” analysis•If P calculated this way is below some threshold , e.g., 0.05, we say that the association between the cancer set and the cell division set is statistically significant.•In general, such “enrichment analysis”–has a set of genes with known function (or other property, e.g., “fast evolving”)–has a set of genes identified by the particular study (e.g., microarray study of cancer)–tests if the identified genes are enriched for the function or propertyGene Ontology•Gene Ontology (GO) is a highly popular source of “known gene sets”, e.g., gene sets with common known functionSource: “Bionformatics” - Polanski and KimmelMotivation for GO•Large numbers of genes in different organisms, with large numbers of protein products and functions•Several databases (often species specific) store such information (e.g., FlyBase)•Need a way to organize all this information from all these databasesConsistent and Structured organization•Browsing through genome databases like FlyBase is powerful, but …•… lack of consistency from one database to another. (The “interface” is not the same.)•Annotations (e.g., functions of genes) could be more “structured”, for much easier and automated access–e.g., GoogleScholar/Pubmed versus Bibliography software/databasesGene Ontology database•In response to these needs. •Aims at structured and consistent terms to describe gene function•A “standardized” and “structured” vocabulary to describe genes and their productsStructure of GO•Tree structure, with three main branches: –“molecular function”: activities of gene product at molecular level, e.g., “DNA binding”–“biological process”: process (series of events), e.g., “metabolism” or “oxygen transport” –“cellular component”: e.g., “cell nucleus”•Every GO term is a descendent of one of these three branches•Tree structure captures natural hierarchy of terms, e.g., “metabolism” is an ancestor node of “amino acid metabolism”GO database•GO is not only a hierarchy of descriptive terms, it is also the assignment of one or more of these terms to genes•Large numbers of genes in different organisms have been manually annotated with GO termsAccessing GO•Can download entire hierarchy of terms, as well as assignment of these terms to genes in a species.•Can access through specialized interfaces, e.g., –GeneMerge for enrichment analysis of sets of genes–AmiGO: Search by gene name (all terms for that gene) or by term name (all genes for that term)http://amigo.geneontology.org/Statistical Tests•Enrichment analysis is a statistical test: •We calculate the random chance of seeing something “as good” (e.g., intersection as large) as what is observed•If this is very small, we “reject the null hypothesis” that the observation happened “by chance”Statistical Testing with genes: another example•Consider a single gene g. Take its expression values in experimental condition 1, and in experimental condition 2.•Compare the two samples, e.g., calculate the difference of means.•Null hypothesis: both samples come from populations with equal means. •Is the difference of means significantly different from zero ? •A statistical test about gene gMultiple genes•Null hypothesis about equality of means is “gene-wise” null hypothesis•How do we extend this framework to test a set of genes (for difference of expression between the two conditions)?•One natural extension: Declare equality in both conditions for every gene. This is the complete null hypothesis.Source: Stat. Methods in Bioinformatics; Ewens & Grant.Complete Null Hypothesis•But this may not be the hypothesis we are interested in accepting or rejecting: rejecting it doesn’t tell us which (subset of) genes are differentially expressed•Rejecting the complete null hypothesis only tells us “some gene is not equally expressed in the two conditions”.What we’d like•A procedure that predicts which genes are differentially expressed•Remember that “prediction” in this context is not a definitive prediction, there is some margin of error•We’d like our procedure to provide us with an overall margin of errorTo be more specific•Let’s go back to gene-wise null hypothesis H0•Pr (X k | H0) < •If we observe random variable X to have a value of k or more, we reject H0, i.e., we predict that H0 is not true.•However, this prediction has a margin of error: •It is possible that H0 is true, yet we rejected it; in fact the probability of this “false positive” error is  !•So we are able to control “false positive” rate in the single gene test.•“Positive” because rejecting null hypothesis usually incriminates the gene as being “interesting” in some way. “False” because H0 being true means that the prediction is false.Controlling false positives•In the multiple gene test procedure (not defined as yet), we would like to predict a set of genes as being interesting, i.e., as violating null hypothesis. •But we would like to have some

View Full Document


School:
Email:
New Password:
Confirm Password:

U of I CS 466 - Statistical Testing with Genes

Sign up for free to view:

Please select your school