Identifying differentially expressed sets of genes in microarray experiments Lecture 23 Statistics 246 April 15 2004 1 A cartoon version of microarrays t ID 19 83 AA495790 16 83 AA598794 15 22 AA488676 14 2AI014487 13 62 R77252 13 6 AA598601 13 57 R09561 13 38 AA875933 12 79 AA777187 12 63 AA598601 12 01 AA055835 11 88 AA012944 10 86 AA936757 10 86 AA995282 10 35 AA677403 9 88 AA430032 9 32 AI935290 9 18 AA936757 9 06 AA424833 9 02 AI985398 8 51 AA630794 8 38 H29897 8 22 W72207 7 99 H45668 7 95 AA600217 7 8 AA149095 7 68 W73874 7 61 R09561 7 53 AW028846 7 16 N66177 7 14 H03346 7 06 AA169469 6 96 AI989348 6 94 H63077 6 92 AA610004 6 84 AA599145 6 78 AA521434 6 77 AA400128 6 68 T53298 6 67 T86983 6 6 AA027240 6 57 AA482117 6 55 AA464849 6 55 AA400893 6 5 R91550 6 45 AA620433 6 45 AA625628 6 41 T77733 Name ras homolog gene family connective tissue growth factor membrane attached signal protein 1 insulin like growth factor binding protein 102 microtubule associated protein 7 insulin like growth factor binding protein 31 decay accelerating factor for complement CD55 EGF containing fibulin like extracellular matrix protein 1 cysteine rich angiogenic inducer insulin like growth factor binding protein 3 caveolin 1 caveolae protein 22kD insulin like growth factor binding protein 102 heparin binding growth factor binding protein four and a half LIM domains 2 glycoprotein hormones alpha polypeptide pituitary tumor transforming 1 cysteine and glycine rich protein 1 heparin binding growth factor binding protein2 bone morphogenetic protein 6 natriuretic peptide receptor C solute carrier family 3 phospholipase C beta 42 cystatin A stefin A Kruppel like factor 4 gut activating transcription factor 4 dual specificity phosphatase 1 cathepsin L decay accelerating factor for complement trefoil factor 2 spasmolytic protein 1 microphthalmia associated transcription factor protease serine 22 pyruvate dehydrogenase kinase isoenzyme 4 protein disulfide isomerase related protein annexin A1 Homo sapiens putative oncogene protein ZW10 Drosophila homolog B cell CLL lymphoma 6 general transcription factor II insulin like growth factor binding protein 7 complement component 1 eukaryotic translation initiation factor 2 Ras homolog enriched in brain 2 thioredoxin reductase 1 phosphodiesterase 1A calmodulin dependent arginine rich mutated in early stage tumors dihydropyrimidinase like 3 accessory proteins BAP31 BAP29 tubulin gamma 1 List of differentially Long genes list of expressed 2 d e genes Long lists of d e genes biological understanding What happens next Select some genes for validation Do follow up experiments on some genes Publish a huge table with the results Try to learn about all the genes on the list read 100s of papers Usually some or all of the above will be done and more Can we help further at this 3 Sets of genes There are usually many sets of genes that might be of interest in a given microarray experiment Examples include genes in biological e g biochemical metabolic and signalling pathways genes associated with a particular location in the cell or genes having a particular function or being involved in a particular process We could even include sets of genes for which all of the preceding are unknown but we have reason believe could be of interest typically from previous experiments In thinking like this it is important to remember that many genes that is their protein products can have multiple functions or be involved in many processes etc There are many databases EcoCyc KEGG of pathways and it is not my intention to review them here We will 4 focus on the most important related concept the GO The Gene Ontology Consortium Ashburner et al Nature Genetics 25 25 29 http www geneontology org The goal of the Gene Ontology TM GO Consortium is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing GO provides three structured networks of defined terms to describe gene product attributes Molecular Function Ontology 7304 terms as of April 5 2004 the tasks performed by individual gene products examples are carbohydrate binding and ATPase activity Biological Process Ontology 8517 terms broad biological goals such as mitosis or purine metabolism that are accomplished by ordered assemblies of molecular functions Cellular Component Ontology 1394 terms subcellular structures locations and macromolecular complexes examples include nucleus telomere and origin recognition complex 5 From the GO web site The path back to each ontology from a gene We will call each term in a path a split 6 Structure of a GO annotation Each gene can have several annotated GOs and each GO can have several splits E g DNA topoisomerase II alpha has 8 GO annotations and 11 splits 7 Annotation of genes to a node in the ontology Each node is also connected to many other related nodes 8 Are sets of genes differentially expressed The sets we refer to here are all the outcomes of analyses Later we discuss sets specified a priori Examples of sets They could be the list of all genes whose differential expression e g average M value exceeds a given threshold typically a liberal one which would not correspond to any real significance e g 1 5 fold They might be clusters What do we mean by a set being differentially expressed Here it is a convenient shorthand for being unusual in relation to all the genes represented on the array for example by being functionally enriched in the sense of having more genes of a given category than one would expect by chance 9 GO and microarray gene sets Hypothesis Functionally related differentially expressed genes should accumulate in the corresponding GO group Problem to find a method which scores accumulation of differential gene expression in a node of the GO We describe the calculation from the program Gostat For all the genes analysed it determines the annotated GO terms and all splits It then counts the of appearances of each GO term for the genes in the set as well as the in the reference set which is typically all genes on the array Then a 2 2 table is formed see over page and a p value calculated 10 Is a GO term is specific for a set count genes with GO term in set count genes without GO term in set Contingency Table 51 416 467 125 8588 8713 173 9004 9177 count in set e g differentially expressed genes Count in reference set e g all genes on array P value 8x10 52 Fisher s exact test or chi square test
View Full Document