DOC PREVIEW
UMD CMSC 423 - Lecture 17 Gene finding

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 17Gene findingSignals in DNA• we have the genome sequence... now what?• ...see chapter 9 ...•Motifs are a kind of “signal” - pattern of DNA that is “unexpected” in the genome of an organism•Uncovering new motifs – already did this – Gibbs sampling (local multiple alignment).• Given a motif – how do we find where it occurs in a genome?•Remember? Motif=–k consecutive positions–frequency of occurrence of each base at these positionsFinding/scoring motifs• Given motif M of length k – can be represented as a Position Weight Matrix (PWM) – same thing as a multiple alignment profile•Scoring a region of the genome according to motif? Given consecutive characters s1,...,sk•How surprising is this? Need to compare to background probabilitiespwmM={ pc ,i∣∀1≤i≤k , c∈}pM ∣s1,... , sk=∏1≤i≤kpsi,ipM ∣s1,... , sk=∏1≤i≤kpsi,i/qsiwhere qsiis background probability of character siin genomeScoring motifs• Note: Score usually presented as a log-likelihood (log(p(M|s1...sk))• The p/q ratios in the motif are often called Position Specific Scoring Matrix (PSSM)•The program psi-blast can search a sequence against a database of PSSMs• Motifs are just one piece of the puzzle• How do we handle more complex “signals”Gene finding/prediction• Given a string of DNA, identify regions that might be genes•Question: What does a gene look like?• Start codon: ATG• Stop codon: TGA, TAG, TAA•Splicing: GT...intron...AG• Also, DNA composition is different in genes – mutations are more likely in the third position of codons.Simple gene finder (in bacteria)• Find all stop-codons in the genome• For each stop-codon, identify an in-frame start-codon upstream of it. •Each section between a start and a stop is called an ORF – open reading frame.• The long ORFs are likely genes – evolution prevented stop codons from occurring•3 stop codons, 64 possible codons => in random DNA every 22nd codon is a stop. GGC TAG ATG AGG GCT CTA ACT ATG GGC GCG TAAGene finding as machine learning• Main question: does the ORF look like a gene?•Given a set of examples – genes we already know• and a string of DNA (e.g. ORF)• compute the likelihood that the ORF is a gene.•Note: more complex than motif finding• Codon usage bias – not all codons for a same amino-acid are equally likely•K-mer (e.g. 6-mer) frequencies (instead of single-base frequencies in motif finding)UUU F 0.76 UCU S 0.27 UAU Y 0.77 UGU C 0.73 UUC F 0.24 UCC S 0.08 UAC Y 0.23 UGC C 0.27 UUA L 0.49 UCA S 0.23 UAA * 0.66 UGA * 0.14 UUG L 0.13 UCG S 0.06 UAG * 0.20 UGG W 1.00 CUU L 0.16 CCU P 0.28 CAU H 0.79 CGU R 0.26 CUC L 0.04 CCC P 0.07 CAC H 0.21 CGC R 0.06CUA L 0.14 CCA P 0.49 CAA Q 0.78 CGA R 0.16CUG L 0.05 CCG P 0.16 CAG Q 0.22 CGG R 0.05 AUU I 0.57 ACU T 0.36 AAU N 0.76 AGU S 0.28 AUC I 0.15 ACC T 0.08 AAC N 0.24 AGC S 0.08AUA I 0.28 ACA T 0.42 AAA K 0.74 AGA R 0.36 AUG M 1.00 ACG T 0.15 AAG K 0.26 AGG R 0.11GUU V 0.32 GCU A 0.34 GAU D 0.81 GGU G 0.30 GUC V 0.07 GCC A 0.07 GAC D 0.19 GGC G 0.09GUA V 0.43 GCA A 0.44 GAA E 0.75 GGA G 0.41 GUG V 0.18 GCG A 0.15 GAG E 0.25 GGG G 0.20Bacillus anthracis codon usageQuestions• Given the G/C content for a genome (fraction of letters in the genome that are G or C), what is the expected distance between two stop codons? - requires Poisson


View Full Document

UMD CMSC 423 - Lecture 17 Gene finding

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Lecture 17 Gene finding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 17 Gene finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 17 Gene finding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?