CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 17 Gene finding Signals in DNA we have the genome sequence now what see chapter 9 Motifs are a kind of signal pattern of DNA that is unexpected in the genome of an organism Uncovering new motifs already did this Gibbs sampling local multiple alignment Given a motif how do we find where it occurs in a genome Remember Motif k consecutive positions frequency of occurrence of each base at these positions Finding scoring motifs Given motif M of length k can be represented as a Position Weight Matrix PWM same thing as a multiple alignment profile pwmM pc i 1 i k c Scoring a region of the genome according to motif Given consecutive characters s1 sk p M s1 sk 1 i k p s i How surprising is this Need to compare to background probabilities i p M s1 s k 1 i k p s i q s where q s is background probability of character si in genome i i i Scoring motifs Note Score usually presented as a log likelihood log p M s1 sk The p q ratios in the motif are often called Position Specific Scoring Matrix PSSM The program psi blast can search a sequence against a database of PSSMs Motifs are just one piece of the puzzle How do we handle more complex signals Gene finding prediction Given a string of DNA identify regions that might be genes Question What does a gene look like Start codon ATG Stop codon TGA TAG TAA Splicing GT intron AG Also DNA composition is different in genes mutations are more likely in the third position of codons Simple gene finder in bacteria Find all stop codons in the genome For each stop codon identify an in frame start codon upstream of it Each section between a start and a stop is called an ORF open reading frame The long ORFs are likely genes evolution prevented stop codons from occurring 3 stop codons 64 possible codons in random DNA every 22nd codon is a stop GGC TAG ATG AGG GCT CTA ACT ATG GGC GCG TAA Gene finding as machine learning Main question does the ORF look like a gene Given a set of examples genes we already know and a string of DNA e g ORF compute the likelihood that the ORF is a gene Note more complex than motif finding Codon usage bias not all codons for a same aminoacid are equally likely K mer e g 6 mer frequencies instead of single base frequencies in motif finding Bacillus anthracis codon usage UUU UUC UUA UUG F F L L 0 76 0 24 0 49 0 13 UCU UCC UCA UCG S S S S 0 27 0 08 0 23 0 06 UAU UAC UAA UAG Y Y 0 77 0 23 0 66 0 20 UGU UGC UGA UGG C C W 0 73 0 27 0 14 1 00 CUU CUC CUA CUG L L L L 0 16 0 04 0 14 0 05 CCU CCC CCA CCG P P P P 0 28 0 07 0 49 0 16 CAU CAC CAA CAG H H Q Q 0 79 0 21 0 78 0 22 CGU CGC CGA CGG R R R R 0 26 0 06 0 16 0 05 AUU AUC AUA AUG I I I M 0 57 0 15 0 28 1 00 ACU ACC ACA ACG T T T T 0 36 0 08 0 42 0 15 AAU AAC AAA AAG N N K K 0 76 0 24 0 74 0 26 AGU AGC AGA AGG S S R R 0 28 0 08 0 36 0 11 GUU GUC GUA GUG V V V V 0 32 0 07 0 43 0 18 GCU GCC GCA GCG A A A A 0 34 0 07 0 44 0 15 GAU GAC GAA GAG D D E E 0 81 0 19 0 75 0 25 GGU GGC GGA GGG G G G G 0 30 0 09 0 41 0 20 Questions Given the G C content for a genome fraction of letters in the genome that are G or C what is the expected distance between two stop codons requires Poisson statistics
View Full Document
Unlocking...