UMD CMSC 423 - Gene Finding - D2461733

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 423> Gene Finding

DOC PREVIEW

UMD CMSC 423 - Gene Finding

School name University of Maryland, College Park

Course Cmsc 423- Bioinformatic Algorithms, Databases, and Tools

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Gene FindingCMSC 423Finding Signals in DNA•We just have a long string of A, C, G, Ts. How can we find the “signals” encoded in it?•Suppose you encountered a language you didn’t know. How would you decipher it?•Idea #1: Based on some external information, build a model (like an HMM) for how particular features are encoded. •Idea #2: Find patterns that appear more often than you expect by chance. (“the” occurs a lot in English, so it may be a word.)•Today: we explore methods based mostly on idea #1. Next time, we will explore idea #2.● substrings encode for genes ,most of which encode for proteins● double-stranded, linear moleculeDNA = ● strands are complements of each other (A 󲰸 T; C 󲰸 G)● each strand is string over {A,C,G,T}mRNAproteinsTranscription(T ➝ U)TranslationGenome“Central Dogma” of BiologyThe Genetic Code•There are 20 different amino acids & 64 different codons.•Lots of different ways to encode for each amino acid.•The 3rd base is typically less important for determining the amino acid•Three different “stop” codons that signal the end of the gene•Start codons differ depending on the organisms, but AUG is often used.The Gene Finding Problem•Genes are subsequences of DNA that (generally) tell the cell how to make specific proteins.•How can we find which subsequences of DNA are genes?Start Codon: ATGStop Codons: TGA, TAG, TAAATAGAGGGTATGGGGGACCCGGACACGATGGCAGATGACGATGACGATGACGATGACGGGTGAAGTGAGTCAACACATGACChallenges:The start codon can occur in the middle of a gene.The stop codon can occur in nonsense DNA between genes.The stop codon can occur “out of frame” inside a gene.Don’t know what “phase” the gene starts in.A Simple Gene Finder1. Find all stop codons in genome2. For each stop codon, find the in-frame start codon farthest upstream of the stop codon, without crossing another in-frame stop codon.3. Return the “long” ORF as predicted genes.Each substring between the start and stop codons is called an ORF “open reading frame”GGC TAG ATG AGG GCT CTA ACT ATG GGC GCG TAA 3 out of the 64 possible codons are stop codons 󲰛 in random DNA, every 22nd codon is expected to be a stop.Gene Finding as a Machine Learning Problem•Given training examples of some known genes, can we distinguish ORFs that are genes from those that are not?•Idea: can use distribution of codons to find genes.•every codon should be about equally likely in non-gene DNA.•every organism has a slightly different bias about how often certain codons are preferred.•could also use frequencies of longer strings (k-mers).Bacillus anthracis (anthrax) codon usageUUU F 0.76 UCU S 0.27 UAU Y 0.77 UGU C 0.73 UUC F 0.24 UCC S 0.08 UAC Y 0.23 UGC C 0.27 UUA L 0.49 UCA S 0.23 UAA * 0.66 UGA * 0.14 UUG L 0.13 UCG S 0.06 UAG * 0.20 UGG W 1.00 CUU L 0.16 CCU P 0.28 CAU H 0.79 CGU R 0.26 CUC L 0.04 CCC P 0.07 CAC H 0.21 CGC R 0.06CUA L 0.14 CCA P 0.49 CAA Q 0.78 CGA R 0.16CUG L 0.05 CCG P 0.16 CAG Q 0.22 CGG R 0.05 AUU I 0.57 ACU T 0.36 AAU N 0.76 AGU S 0.28 AUC I 0.15 ACC T 0.08 AAC N 0.24 AGC S 0.08AUA I 0.28 ACA T 0.42 AAA K 0.74 AGA R 0.36 AUG M 1.00 ACG T 0.15 AAG K 0.26 AGG R 0.11GUU V 0.32 GCU A 0.34 GAU D 0.81 GGU G 0.30 GUC V 0.07 GCC A 0.07 GAC D 0.19 GGC G 0.09GUA V 0.43 GCA A 0.44 GAA E 0.75 GGA G 0.41 GUG V 0.18 GCG A 0.15 GAG E 0.25 GGG G 0.20An Improved Simple Gene Finder•Score each ORF using the product of the probability of each codon:GFScore(g) = Pr(codon1)xPr(codon2)xPr(codon3)x...xPr(codonn)But: as genes get longer, GFScore(g) will decrease. So: we should calculate GFScore(g[i...i+k]) for some window size k.The final GFSCORE(g) is the average of the Scores of the windows in it.Eukaryotic Genes & Exon SplicingATG TAGATG TAGintron intron intronexonexonexonexonProkaryotic (bacterial) genes look like this:Eukaryotic genes usually look like this:AUG UAGExons are concatenated togetherIntrons are thrown awayThis spliced RNA is what is translated into a protein.mRNA:A (Bad) HMM Eukaryotic Gene FinderArrows show transitions with non-zero probabilitiesWhat are some reasons this HMM gene finder is likely to do poorly?pos1pos3pos2introndonor1donor2acceptor2acceptor1Start 1 Start 3Start 2Stop 1 Stop 3Stop 2ENDSTARTPr(A) = 1 Pr(T) = 1Pr(G) = 1Bad Eukaryotic Gene Finder•The positions in the codons are treated independently: the probability of emitting a base can’t depend on which previous base was emitted.•Only one strand of the DNA is considered at once.•Length distributions of introns and exons are not considered.An Generalized HMM-based Gene FinderGlimmerHMM model+ strand- strandAn Generalized HMM-based Gene FinderGlimmerHMM model+ strand- strandGlimmerHMM Performance% of true gene nucleotides that GlimmerHMM predicts as part of genes.% of predicted in-gene nucleotides that are correct% of true exons that GlimmerHMM found.% of predicted exons that are true exons.% of genes perfectly foundCompare with GENSCAN•On 963 human genes:•Note that overall accuracy is pretty low.Recap•Simple gene finding approaches use codon bias and long ORFs to identify genes.•Many top gene finding programs are based on generalizations of Hidden Markov Models.•Basic HMMs must be generalized to emit variable sized

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 17 pages.

UMD CMSC 423 - Gene Finding

Sign up for free to view:

Please select your school