Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A C G Ts How can we find the signals encoded in it Suppose you encountered a language you didn t know How would you decipher it Idea 1 Based on some external information build a model like an HMM for how particular features are encoded Idea 2 Find patterns that appear more often than you expect by chance the occurs a lot in English so it may be a word Today we explore methods based mostly on idea 1 Next time we will explore idea 2 Central Dogma of Biology proteins Translation mRNA T U Transcription Genome DNA double stranded linear molecule strands are complements of each other A T C G each strand is string over A C G T substrings encode for genes most of which encode for proteins The Genetic Code There are 20 different amino acids 64 different codons Lots of different ways to encode for each amino acid The 3rd base is typically less important for determining the amino acid Three different stop codons that signal the end of the gene Start codons differ depending on the organisms but AUG is often used The Gene Finding Problem Genes are subsequences of DNA that generally tell the cell how to make specific proteins How can we find which subsequences of DNA are genes Start Codon ATG Stop Codons TGA TAG TAA ATAGAGGGTATGGGGGACCCGGACACGATGGCAGATGACGATGACGATGACGATGACGGGTGAAGTGAGTCAACACATGAC Challenges The start codon can occur in the middle of a gene The stop codon can occur in nonsense DNA between genes The stop codon can occur out of frame inside a gene Don t know what phase the gene starts in A Simple Gene Finder 1 Find all stop codons in genome 2 For each stop codon find the in frame start codon farthest upstream of the stop codon without crossing another in frame stop codon GGC TAG ATG AGG GCT CTA ACT ATG GGC GCG TAA Each substring between the start and stop codons is called an ORF open reading frame 3 Return the long ORF as predicted genes 3 out of the 64 possible codons are stop codons every 22nd codon is expected to be a stop in random DNA Gene Finding as a Machine Learning Problem Given training examples of some known genes can we distinguish ORFs that are genes from those that are not Idea can use distribution of codons to find genes every codon should be about equally likely in non gene DNA could also use frequencies of longer strings k mers every organism has a slightly different bias about how often certain codons are preferred Bacillus anthracis anthrax codon usage UUU UUC UUA UUG F F L L 0 76 0 24 0 49 0 13 UCU UCC UCA UCG S S S S 0 27 0 08 0 23 0 06 UAU UAC UAA UAG Y Y 0 77 0 23 0 66 0 20 UGU UGC UGA UGG C C W 0 73 0 27 0 14 1 00 CUU CUC CUA CUG L L L L 0 16 0 04 0 14 0 05 CCU CCC CCA CCG P P P P 0 28 0 07 0 49 0 16 CAU CAC CAA CAG H H Q Q 0 79 0 21 0 78 0 22 CGU CGC CGA CGG R R R R 0 26 0 06 0 16 0 05 AUU AUC AUA AUG I I I M 0 57 0 15 0 28 1 00 ACU ACC ACA ACG T T T T 0 36 0 08 0 42 0 15 AAU AAC AAA AAG N N K K 0 76 0 24 0 74 0 26 AGU AGC AGA AGG S S R R 0 28 0 08 0 36 0 11 GUU GUC GUA GUG V V V V 0 32 0 07 0 43 0 18 GCU GCC GCA GCG A A A A 0 34 0 07 0 44 0 15 GAU GAC GAA GAG D D E E 0 81 0 19 0 75 0 25 GGU GGC GGA GGG G G G G 0 30 0 09 0 41 0 20 An Improved Simple Gene Finder Score each ORF using the product of the probability of each codon GFScore g Pr codon1 xPr codon2 xPr codon3 x xPr codonn But as genes get longer GFScore g will decrease So we should calculate GFScore g i i k for some window size k The final GFSCORE g is the average of the Scores of the windows in it Eukaryotic Genes Exon Splicing Prokaryotic bacterial genes look like this ATG TAG Eukaryotic genes usually look like this ATG exon intron exon intron exon exon intron Introns are thrown away mRNA AUG UAG Exons are concatenated together This spliced RNA is what is translated into a protein TAG A Bad HMM Eukaryotic Gene Finder Arrows show transitions with nonzero probabilities START Pr G 1 Start 1 Start 2 Pr A 1 Pr T 1 Start 3 pos 1 acceptor 2 donor 1 acceptor 1 donor 2 pos 2 What are some reasons this HMM gene finder is likely to do poorly pos 3 Stop 1 Stop 2 Stop 3 END intron Bad Eukaryotic Gene Finder The positions in the codons are treated independently the probability of emitting a base can t depend on which previous base was emitted Only one strand of the DNA is considered at once Length distributions of introns and exons are not considered An Generalized HMM based Gene Finder strand strand GlimmerHMM model An Generalized HMM based Gene Finder strand strand GlimmerHMM model GlimmerHMM Performance of predicted ingene nucleotides that are correct of true gene nucleotides that GlimmerHMM predicts as part of genes of predicted exons that are true exons of true exons that GlimmerHMM found of genes perfectly found Compare with GENSCAN On 963 human genes Note that overall accuracy is pretty low Recap Simple gene finding approaches use codon bias and long ORFs to identify genes Many top gene finding programs are based on generalizations of Hidden Markov Models Basic HMMs must be generalized to emit variable sized strings
View Full Document
Unlocking...