Slide 1Gene structureNeedles in a HaystackGene FindingSignals for Gene FindingSlide 6Exon and Intron LengthsNucleotide CompositionSlide 9Splice SitesHMMs for Gene RecognitionHMMs for Gene RecognitionDuration HMMs for Gene RecognitionGenscanUsing Comparative InformationUsing Comparative InformationPatterns of ConservationComparison-based Gene FindersTwinscanSLAM – Generalized Pair HMMNSCAN—Multiple Species Gene PredictionNSCAN—Multiple Species Gene PredictionPerformance ComparisonCONTRASTCONTRASTCONTRAST - FeaturesCONTRAST – SVM accuraciesCONTRAST - DecodingCONTRAST - TrainingPerformance ComparisonPerformance ComparisonCS262 Lecture 9, Win07, BatzoglouGene RecognitionCS262 Lecture 9, Win07, BatzoglouGene structureexon1exon2 exon3intron1 intron2transcriptiontranslationsplicingexon = protein-codingintron = non-codingCodon:A triplet of nucleotides that is converted to one amino acidCS262 Lecture 9, Win07, BatzoglouNeedles in a HaystackCS262 Lecture 9, Win07, Batzoglou•Classes of Gene predictorsAb initio•Only look at the genomic DNA of target genomeDe novo•Target genome + aligned informant genome(s)EST/cDNA-based & combined approaches•Use aligned ESTs or cDNAs + any other kind of evidenceGene FindingEXON EXON EXON EXON EXON Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tgCS262 Lecture 9, Win07, BatzoglouSignals for Gene Finding1. Regular gene structure2. Exon/intron lengths3. Codon composition4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites5. Patterns of conservation6. Sequenced mRNAs 7. (PCR for verification)CS262 Lecture 9, Win07, BatzoglouNext Exon:Frame 0Next Exon:Frame 1CS262 Lecture 9, Win07, BatzoglouExon and Intron LengthsCS262 Lecture 9, Win07, BatzoglouNucleotide Composition•Base composition in exons is characteristic due to the genetic codeAmino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGGAmino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGGCS262 Lecture 9, Win07, BatzoglouatgatgtgatgaggtgagggtgagggtgagggtgagggtgagggtgagcaggtgcaggtgcagatgcagatgcagttgcagttgcaggcccaggccggtgagggtgagCS262 Lecture 9, Win07, BatzoglouSplice Sites(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)CS262 Lecture 9, Win07, BatzoglouHMMs for Gene RecognitionGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAexonexonexonintronintronintergeneintergeneIntergene StateIntergene StateFirst Exon StateFirst Exon StateIntronStateIntronStateCS262 Lecture 9, Win07, BatzoglouHMMs for Gene RecognitionexonexonexonintronintronintergeneintergeneIntergene StateIntergene StateFirst Exon StateFirst Exon StateIntronStateIntronStateGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTA CATTAACS262 Lecture 9, Win07, BatzoglouDuration HMMs for Gene RecognitionTAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GC C C C C C CExon1 Exon2 Exon3Duration diPINTRON(xi | xi-1…xi-w)PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)j+2P5’SS(xi-3…xi+4)PSTOP(xi-4…xi+3)CS262 Lecture 9, Win07, BatzoglouGenscan•Burge, 1997•First competitive HMM-based gene finder, huge accuracy jump•Only gene finder at the time, to predict partial genes and genes in both strandsFeatures–Duration HMM–Four different parameter sets•Very low, low, med, high GC-contentCS262 Lecture 9, Win07, BatzoglouUsing Comparative InformationCS262 Lecture 9, Win07, BatzoglouUsing Comparative Information •Hox cluster is an example where everything is conservedCS262 Lecture 9, Win07, BatzoglouPatterns of Conservation30% 1.3%0.14% 58%14%10.2%Genes Intergenic Mutations Gaps FrameshiftsSeparation2-fold10-fold75-foldCS262 Lecture 9, Win07, BatzoglouComparison-based Gene Finders•Rosetta, 2000•CEM, 2000–First methods to apply comparative genomics (human-mouse) to improve gene prediction•Twinscan, 2001–First HMM for comparative gene prediction in two genomes•SLAM, 2002–Generalized pair-HMM for simultaneous alignment and gene prediction in two genomes•NSCAN, 2006–Best method to-date based on a phylo-HMM for multiple genome gene predictionCS262 Lecture 9, Win07, BatzoglouTwinscan1. Align the two sequences (eg. from human and mouse)2. Mark each human
View Full Document