Unformatted text preview:

1Computational Approaches to Functional GenomicsOlga TroyanskayaAssistant ProfessorLewis-Sigler Institute for Integrative Genomics &Department of Computer SciencePrinceton UniversityLaboratory of Bioinformatics & Functional Genomics2A primer: Molecular biology 101 3Cells are fundamental working units of all organisms4Yeast are unicellular organismsHumans are multi-cellular organismsUnderstanding how a cell works is critical to understanding how the organism functions5DNA Uses alphabet of 4 letters {ATCG}, called bases Encodes genetic information in triplet code Structure: a double helix6Proteins A sequence of amino acids (alphabet of 20) Each amino acid encoded by 3 DNA bases Perform most of the actual work in the cell Fold into complex 3D structureCourtesy of the Zhou Laboratory, The State University of New York at Buffalo27How does a cell function?Courtesy U.S. Department of Energy Genomes to Life program DNA is a sequence of bases {A, T, C, G}TAT-CGT-AGTProteins consist of amino acids, whose sequence is encoded in DNATyr-Arg-SerEach 3 bases of DNA encode 1 amino acid8DNA-RNA-protein9Genes vs. proteins Genes are units of inheritance They are static blueprints It’s proteins (dynamic) that do most of the work  The process of making mRNA, and then protein from a gene (or genes) is called GENE EXPRESSION It’s the control of gene expression that causes most phenotypic differences in organisms10Gene Regulatory Circuit Genes =? wires Motifs =? gatesA B Make DCIf C then DIf B then NOT DIf A and B then DDMake BDIf D then BCgene Dgene B11The “greatness” of genomics… Biological systems are complex Many biological processes & diseases result from complex changes on molecular level Need to observe & model cellular processes on a systems levelHigh-throughput technologies have lead to an explosion of data in biology in hopes of understanding biological systems12Explosion of functional genomic DATAKNOWLEDGE of components and inter-relationships that lead to function?… And its “downfall”313 Why have genomic data not been utilized fully?Challenges: •Genomic data are noisy•Genomic data are heterogeneous•Coverage/accuracy varies by biological process14 Computation is a tool for functional genomics Our approach:(1) Integrated analysis of diverse data(2) Probabilistic methods to battle noise in data(3) Integrating computation and experiments(4) Accessibility and usefulness to community (bringing experts into the analysis loop and feedback to experimental biology)Computational methods (and targeted experiments) can greatly aid in extracting knowledge from biological data, but several challenges must be addressed:15Story #1: predicting function of unknown proteins16 Predicting gene function using the Gene Ontology hierarchy A number of previous approaches to function prediction from diverse data, most use GO biological process terms However, GO is a hierarchy• Could improve accuracy by enforcing Hierarchical consistencyBiological ProcessRegulation Cellular ProcessRegulation ofCellular ProcessUnknownCell Differentiation17Hierarchical ConsistencyAll genesAll genesAll genesTRAININGcytokinesisNObud site selectionYEScell proliferationYESEVALUATION18Our Method Individual classifiers for each class Inconsistent predictions allowed Any classification algorithm can be used Parallel evaluation Bayesian combination of predictions Inconsistencies resolved globally Any inference algorithm can be used419mRNA processingmRNA metabolism RNA processingRNA metabolismA Bayesian Frameworky4y2y3y1g1g3g4g2Given predictions g1...gN∈ℜ, find true labels y1...yN∈ {0,1}that maximizeP(y1...yN| g1...gN) = α P(g1...gN| y1...yN) P(y1...yN)20Data Types (for Saccharomyces cerevisiae) The Gene Ontology 105 “meaningful” nodes selected Pairwise Interaction (GRID) Affinity Precipitation Affinity Chromatography Two-Hybrid Purified Complex Biochemical Assay Synthetic Lethality Synthetic Rescue Dosage Lethality ColocalizationO’SheaCurated Complexes(152 features) Transcription Factor Binding SitesPROSPECT(39 features) Microarrays (SMD) Spellman et al., 1998 Gasch et al., 2000, 2001 Sudarsanam et al., 2000 Yoshimoto et al., 2002 Chu et al., 1998 Shakoury-Elizeh et al., 2003 Ogawa et al., 2000(342 features)21Does hierarchical consistency help? For each class, 10 linear SVMs trained by bootstrapping Median of unthresholded outputs used (bagging) Area under the ROC curve (AUC) for evaluation 93 of 105 nodes (86%) are improved by Bayesian correction. Best ΔAUC = +0.346 (+63% of old AUC) Worst ΔAUC = -0.031 (-3% of old AUC) Average ΔAUC = +0.033 (+4% of old AUC)22 Most processes improve in accuracy (AUC Scatter Plot)23AUC Changes24Held-out Example: YNL261WRaw SVM outputsBayes-marginal probabilitiesRaw SVM Predictions Bayes Net Probabilities525Verification: New Data GO since our April 2004 snapshot 105 new annotations for 88 genes Predictions over the 88 genes on our data Independent SVMs 32% precision, 7% recall Bayesian correction 32% precision, 20% recall 51% precision, 7 % recall26Predictions of novel proteins involved in mitosis Lab testing of some predictions for mitosis YMR144W - “mitotic chromosome segregation” Large-budded YMR144WΔ cells -> frequent nuclear defects YOR315W - “mitotic spindle assembly” Cells were fixed and Large-budded YOR315WΔ cells -> frequent misaligned spindles (anti-a-tubulin antibody) and nuclear defects. YMR299C – “mitotic cell cycle” Lee et al. (2005) showed YMR299C protein that is part of a dynein pathway Independent SVMs miss these.27Experimental validationYMR144WΔ YOR315WΔWild Type28Summary Using multiple information sources helps prediction accuracy Multiple diverse data sources Using gene ontology hierarchy Probabilistic and machine learning approaches can generate experimentally testable predictions Our hierarchical consistency approach increases accuracy and generates novel predictions29Story #2: predicting biological networks30Functional genomic DATAKNOWLEDGE of components and inter-relationships that lead to function?Specific goal: building biological networks from experimental data• Gene expression• Physical protein-protein interactions• Genetic interactions• Cellular localization• Sequence


View Full Document

Princeton COS 424 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?