DOC PREVIEW
U of I CS 498 - Statistical Testing with Genes

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Statistical Testing with GenesSaurabh SinhaCS 498 SSHypergeometric test• A population of N genes• Microarray data identified a subset of ngenes as being up-regulated in cancer• Also know a set of m genes involved incell division• Want to test for association betweencancer and cell-divisionHypergeometric testPopulation (N)Cancer (n) Cell Div (m)Cancer and Cell Div (k)Is the intersection (size “k”) significant large, given N, m, n?Hypergeometric DistributionGiven that m of N genes are labeled “cell division”.If we picked a random sample of n genes, how likelyis an intersection equal to k ?Hypergeometric DistributionGiven that m of N genes are labeled “cell division”.If we picked a random sample of n genes, how likelyis an intersection equal to or greater than k ?! P = f ( j;N,m,n)j"k#“Enrichment” analysis• If P calculated this way is below somethreshold α, e.g., 0.05, we say that theassociation between the cancer set and thecell division set is statistically significant.• In general, such “enrichment analysis”– has a set of genes with known function (or otherproperty, e.g., “fast evolving”)– has a set of genes identified by the particularstudy (e.g., microarray study of cancer)– tests if the identified genes are enriched for thefunction or propertyGene Ontology• Gene Ontology (GO) is a highly popularsource of “known gene sets”, e.g., genesets with common known functionSource: “Bionformatics” - Polanski and KimmelMotivation for GO• Large numbers of genes in differentorganisms, with large numbers ofprotein products and functions• Several databases (often speciesspecific) store such information (e.g.,FlyBase)• Need a way to organize all thisinformation from all these databasesConsistent and Structuredorganization• Browsing through genome databases likeFlyBase is powerful, but …• … lack of consistency from one database toanother. (The “interface” is not the same.)• Annotations (e.g., functions of genes) couldbe more “structured”, for much easier andautomated access– e.g., GoogleScholar/Pubmed versus Bibliographysoftware/databasesGene Ontology database• In response to these needs.• Aims at structured and consistent termsto describe gene function• A “standardized” and “structured”vocabulary to describe genes and theirproductsStructure of GO• Tree structure, with three main branches:– “molecular function”: activities of gene product atmolecular level, e.g., “DNA binding”– “biological process”: process (series of events),e.g., “metabolism” or “oxygen transport”– “cellular component”: e.g., “cell nucleus”• Every GO term is a descendent of one ofthese three branches• Tree structure captures natural hierarchy ofterms, e.g., “metabolism” is an ancestor nodeof “amino acid metabolism”GO database• GO is not only a hierarchy of descriptiveterms, it is also the assignment of oneor more of these terms to genes• Large numbers of genes in differentorganisms have been manuallyannotated with GO termsAccessing GO• Can download entire hierarchy of terms, aswell as assignment of these terms to genes ina species.• Can access through specialized interfaces,e.g.,– GeneMerge (Assignment 5) for enrichmentanalysis of sets of genes– AmiGO: Search by gene name (all terms for thatgene) or by term name (all genes for that term)http://www.genedb.org/amigo/perl/go.cgiStatistical Tests• Enrichment analysis is a statistical test:• We calculate the random chance ofseeing something “as good” (e.g.,intersection as large) as what isobserved• If this is very small, we “reject the nullhypothesis” that the observationhappened “by chance”Statistical Testing with genes:another example• Consider a single gene g. Take its expressionvalues in experimental condition 1, and inexperimental condition 2.• Compare the two samples, e.g., calculate thedifference of means.• Null hypothesis: both samples come frompopulations with equal means.• Is the difference of means significantlydifferent from zero ?• A statistical test about gene gMultiple genes• Null hypothesis about equality of meansis “gene-wise” null hypothesis• How do we extend this framework to testa set of genes (for difference ofexpression between the two conditions)?• One natural extension: Declare equality inboth conditions for every gene. This is thecomplete null hypothesis.Source: Stat. Methods in Bioinformatics; Ewens & Grant.Complete Null Hypothesis• But this may not be the hypothesis we areinterested in accepting or rejecting: rejecting itdoesn’t tell us which (subset of) genes aredifferentially expressed• Rejecting the complete null hypothesis onlytells us “some gene is not equally expressedin the two conditions”.What we’d like• A procedure that predicts which genesare differentially expressed• Remember that “prediction” in thiscontext is not a definitive prediction,there is some margin of error• We’d like our procedure to provide uswith an overall margin of errorTo be more specific• Let’s go back to gene-wise null hypothesis H0• Pr (X > k | H0) < α• If we observe random variable X to have a value of k ormore, we reject H0, i.e., we predict that H0 is not true.• However, this prediction has a margin of error: α• It is possible that H0 is true, yet we rejected it; in fact theprobability of this “false positive” error is α !• So we are able to control “false positive” rate in the singlegene test.• “Positive” because rejecting null hypothesis usually incriminates the geneas being “interesting” in some way. “False” because H0 being true meansthat the prediction is false.Controlling false positives• In the multiple gene test procedure (notdefined as yet), we would like to predicta set of genes as being interesting, i.e.,as violating null hypothesis.• But we would like to have some controlover (i.e., some idea of) the falsepositive rateFalse Discovery Rate• Is one such procedure• The final outcome will be a set of genespredicted to be differentially expressed• We will have some control on the proportion offalse positives among these predicted genes• Say there are 10,000 genes and 100 are truedifferentially expressed. It be OK to predict someset of 100 genes, with the disclaimer that 50 ofthese may be false positives.– False Discovery Rate of 50% may be OK !Some definitions• Consider the genes for which null


View Full Document

U of I CS 498 - Statistical Testing with Genes

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Statistical Testing with Genes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical Testing with Genes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical Testing with Genes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?