1Computational Molecular Biology and GenomicsDannie DurandFall 2006Lecture 2Panagiotis’ office hoursMondays– Noon – 1pm– 1pm – 2pm– 4pm – 5pm– 5pm – 6pmConflictConflictI speak quickly.My handwriting is terrible.Please interrupt and ask questions.http://www.cs.cmu.edu/~durand/03-711/Emerging problems in genomics• The origins of computational biology• Genomics: from genes to cells–High-throughput functional genomics–Computational genomics2Genes Encode ProteinsGTGCACCTGACTCCTGAG...V H L T P E...A gene is a DNA sequenceA protein is an amino acid sequenceA protein folds into a 3D structure…aggaggcctcgcctctcccagcatgggctggggctcctgtcccccactgtgtgtgcctggggcctggccaggactcccagtga…chromosomeDNAcellA gene is a locus on a chromosomeGenomes:The complete instruction setNeisseria gonorrhoeae Homo sapiensA prokaryotic genome A eukaryotic genomeEmerging problems in genomics• The origins of computational biology• Genomics: from genes to cells–High-throughput functional genomics–Computational genomics3The Origins of Computational BiologySanger-Coulson sequencingMaxam-Gilbert sequencing1970198019901995Gilbert, Sanger win Nobel PrizeCongress establishes GenbankHuman Genome Project beginsGenBank goes online.First whole genome sequenceARPANETWorld Wide Web, GopherNCSA MosaicPizza Hut goes on lineFirst royal emailUSENET newsgroupsTCP/IPInternetWhole Genome Sequencing Highlights(A Eukarya-centric View)1995 H. influenzae – 1stwhole genome sequence1997 Yeast – 1steukaryotic sequence 1998 Caenorhabditis elegans – 1stmulticellular organism2000 Fly, Arabidopsis thaliana –1stplant 2001 Human2002 Mouse, Ciona intestinalis, 2003 Caenorhabditis briggsae, Neurospora Crassa2004 Five more yeasts, silkworm, rat, C. merolae, tetraodon2005 Dictyostelium, zebrafishIn the pipeline: Chicken, fugu, rice, dog, cat, chimpanzee, more fruitflies….413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteriaIn progress: 1034 prokaryotic genomes, 629 eukaryotic genomes www.genomesonline.orgWhole Genome SequencingATGCACCTGACTCCTGAG...Gene sequencesComputational Molecular BiologyComputational analysis of a few genes:– Sequence analysis• Pairwise alignment, database searching• Multiple alignment• Motifs, HMMs– Reconstruct evolutionary history – Structure prediction and modeling4GTGCACCTGACTCCTGAG...Gene sequencesGenomic sequencesComputational GenomicsComputational implications:– Need algorithms that scale up– New biological questions¾ new computational problemsOutline• The origins of computational biology• Genomics: from genes to cells–High-throughput functional genomics–Computational genomicsThe FantasyTGAAATAAACAACCAGGCAGCAGTTATTAACACGGGAACATGGCGGCCGCAGCCTGGGCTCCCGCGGCGGCGGCGG…Cell Function SimulatorWhole genome sequenceCell Simulator CompilerFrom genes to cells5• Predict – all genes– all gene products (protein, RNA)– regulatory motifs• Predict structure and function of individual components• Reconstruct the cellular networks– Regulatory pathways– Metabolic pathways– Signaling pathways …• Model cellular behaviorFrom genes to cellsNew computational methods• New, better algorithms• Use data in new ways Comparative genomics• Genomic sequence• Gene content• Gene orderCombine different types of dataNew high throughput data• mRNA expression• Splice variants• Protein expression • Sub-cellular localization• Protein-protein interactions• Protein-DNA interactionsFrom Genes to CellsComputational Functional GenomicsHigh-thoughput functional assaysComputational support for • data acquisition• data analysisHigh-thoughput sequencingComputational support for • data acquisition• data analysisEmerging problems in genomics• The origins of computational biology• Genomics: from genes to cells–High-throughput functional genomics–Computational genomics6High-throughput Functional Genomics• mRNA expression • Splice variants• Protein expression • Sub-cellular localization• Protein-protein interactions• Protein-DNA interactionsgenesmRNAsDetermine the set of all genes being transcribed in a given cell type under particular conditionsAlternate splice forms:exon6exon1 exon2 exon3 exon5exon1 exon2 exon3 exon4DNA:RNA transcript:exon1 exon2 exon3 exon4exon6exon5exon1 exon2 exon3 exon4exon6exon5Determine the set of splice variants in a given cell type under particular conditionsAssays for detecting RNA transcripts¾Expressed Sequence Tags (ESTs) – small pieces of DNA sequence (usually 200 to 500 nucleotides long) – generated by sequencing either one or both ends of an expressed gene. • DNA microarrays7Expressed Sequence Tags (ESTs)CAUGACUCCUUGGCUAC...CCGAGUGCGGCAAAAAAAmRNACapture an expressed geneExpressed Sequence Tags (ESTs)CAUGACUCCUUGGCUAC...CCGAGUGCGGCAAAAAAAGTACTGAGGAACCGATG...GGCTCACGCCGTTTTTTTreverse transcriptasecDNACAUGACUCCUUGGCUAC...CCGAGUGCGGCAAAAAAAmRNAcomplementary DNANo introns, shows splice variantsExpressed Sequence Tags (ESTs)degradation of mRNA, synthesis of second DNA strandCATGACTCCTTGGCTAC...CCGAGTGCGGCAAAAAAAGTACTGAGGAACCGATG...GGCTCACGCCGTTTTTTTdsDNACAUGACUCCUUGGCUAC...CCGAGUGCGGCAAAAAAAGTACTGAGGAACCGATG...GGCTCACGCCGTTTTTTTreverse transcriptasecDNACAUGACUCCUUGGCUAC...CCGAGUGCGGCAAAAAAAmRNAdouble-stranded DNAreverse primer3’ ESTforward primer5’ ESTSequence fragments from 5’ and/or 3’ endExpressed Sequence Tags– Single-pass sequencing of “random” transcripts– Relatively low quality sequence– 5’ or 3’ end– Tissue specific– No guarantee• that all genes are represented• that all splice forms are represented5’ ESTsmRNA3’ ESTs8ESTs: molecular tags for genes.ESTs– A fast way to capture the coding portion of the genome. (In eukaryotes, most of the genome does not contain protein coding genes. )– Provide a crude measure of transcript abundance. However, rare transcripts may be missed.– Provide a crude measure of splice variants (if at the 3’ or 5’ end of the gene).Assays for detecting RNA transcripts• Expressed Sequence Tags (ESTs) – small pieces of DNA sequence (usually 200 to 500 nucleotides long) – generated by sequencing either one or both ends of an expressed gene. ¾DNA microarrays– use DNA hybridization (physical pattern matching) to capture transcripts that contain a known sequenceDNA microarraysTargets: Each well contains a cDNAoligonucleotide corresponding to a unique subsequence of a genecgtaacgctatDNA
View Full Document