Gene RecognitionThe Central DogmaGene structureLocating GenesFinding Genes in YeastSlide 6Slide 7Introns: The Bane of ORF ScanningIntrons: The Bane of ORF ScanningPowerPoint PresentationSlide 11Needles in a HaystackNow What?Regular Gene StructureSlide 15Slide 16Nucleotide CompositionBiological SignalsSlide 19Slide 20Splice SitesSlide 22Patterns of ConservationThree PeriodicitySlide 25Slide 26GENSCANSlide 28Slide 29GENSCAN PerformanceTWINSCANThe TWINSCAN ModelTWINSCAN PerformanceN-SCANN-SCAN ImprovementsHMM OutputsPhylogenetic Bayesian Network ModelsHomology-Based Gene PredictionEvaluating PerformanceExact Exon AccuracyExact Gene AccuracyIntron Sensitivity By LengthHuman Informant EffectivenessDrosophila Informant EffectivenessThe FutureGene RecognitionCredits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge SaxonovThe Central DogmaProteinRNADNAtranscriptiontranslationCCTGAGCCAACTATTGATGAAPEPTIDECCUGAGCCAACUAUUGAUGAAGene structureexon1exon2 exon3intron1 intron2transcriptiontranslationsplicingexon = protein-codingintron = non-codingCodon:A triplet of nucleotides that is converted to one amino acidLocating Genes•We have a genome sequence, maybe with related genomes aligned to it…where are the genes?•Yeast genome is about 70% protein coding•About 6000 genes•Human genome is about 1.5% protein coding•About 22,000 genesFinding Genes in YeastStart codonATG5’ 3’Stop codonTAG/TGA/TAAIntergenicCodingIntergenicMean coding length about 1500bp (500 codons)TranscriptFinding Genes in Yeast•ORF ScanningLook for long open reading frames (ORFs)ORFs start with ATG and contain no in-frame stop codonsLong ORFs unlikely to occur by chance (i.e., they are probably genes)Finding Genes in YeastYeast ORF distributionIntrons: The Bane of ORF ScanningStart codonATG5’3’Stop codonTAG/TGA/TAASplice sitesIntergenicExonIntronIntergenicExonExonIntronTranscriptIntrons: The Bane of ORF Scanning• Drosophila:• 3.4 introns per gene on average• mean intron length 475, mean exon length 397• Human:• 8.8 introns per gene on average• mean intron length 4400, mean exon length 165• ORF scanning is defeatedWhere are the genes?Where are the genes?Needles in a HaystackNow What?•We need to use more information to help recognize genesRegular structureExon/intron lengthsNucleotide compositionBiological signals•Start codon, stop codon, splice sitesPatterns of conservationRegular Gene Structure•Protein coding region starts with ATG, ends with TAA/TAG/TGA•Exons alternate with introns•Introns start with GT/GC, end with AG•Each exon has a reading frame determined by the codon position at the end of the last exonNext Exon:Frame 0Next Exon:Frame 1Exon/Intron LengthsNucleotide Composition•Base composition in exons is characteristic due to the genetic codeAmino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGGBiological Signals•How does the cell recognize start/stop codons and splice sites?In part, from characteristic base composition•Donor site (start of intron) is recognized by a section of U1 snRNAU1 snRNA: GUCCAUUCADonor site consensus: MAGGTRAGTM means “A or C”, R means “A or G”atgtgaggtgagggtgagggtgagcaggtgcagatgcagttgcaggccggtgag5’3’Donor sitePosition-8 … -2 -1 0 1 2 … 17A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 25Splice SitesSplice Sites(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)•WMM: weight matrix model = PSSM (Staden 1984)•WAM: weight array model = 1st order Markov (Zhang & Marr 1993)•MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account•For each position I, calculate Si = ji2(Ci, Xj)•Choose i* such that Si* is maximal and partition into two subsets, until•No significant dependencies left, or•Not enough sequences in subsetTrain separate WMM models for each subsetAll donor splice sitesG5not G5G5G-1G5not G-1G5G-1A2G5G-1not A2G5G-1A2U6G5G-1A2not U6Splice SitesPatterns of Conservation•Functional sequences are much more conserved than nonfunctional sequences•Signal sequences show compensatory mutationsIf one position mutates away from consensus, often a different one will mutate to consensus•Coding sequence shows three-periodic pattern of conservationThree Periodicity•Most amino acids can be coded for by more than one DNA triplet•Usually, the degeneracy is in the last positionHuman CCTGTT (Proline, Valine)Mouse CCAGTC (Proline, Valine)Rat CCAGTC (Proline, Valine)Dog CCGGTA (Proline, Valine)Chicken CCCGTG (Proline, Valine)GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAExonExonExonIntronIntronIntergenicIntergenicHidden Markov Models for Gene FindingIntergene StateFirst Exon StateIntronStateGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAExonExonExonIntronIntronIntergenicIntergenicHidden Markov Models for Gene FindingIntergene StateFirst Exon StateIntronStateGENSCANGENSCAN•Burge and Karlin, Stanford, 1997•Before The Human Genome ProjectNo alignments availableEstimated human gene count was 100,000•Explicit state duration HMM (with tricks)Intergenic and intronic regions have geometric length distributionExons are only possible when correct flanking sequences are presentGENSCAN•Output probabilities for NC and CDS depend on previous 5 bases (5th-order)P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5)•Each CDS frame has its own model•WAM models for start/stop codons and acceptor sites•MDD model for donor sites•Separate parameters for regions of different GC contentGENSCAN Performance•First program to do well on realistic sequencesLong, multiple genes in both orientations•Pretty good sensitivity, poor specificity70% exon Sn, 40% exon Sp•Not enough exons per gene•Was the best gene predictor for about 4 yearsTWINSCAN•Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001•Uses an informant sequence to help predict genesFor
View Full Document