Unformatted text preview:

Motif Finding CMSC 423 Motif Finding Transcription factor 1 2 3 4 5 6 ttgccacaaaataatccgccttcgcaaattgaccTACCTCAATAGCGGTAgaaaaacgcaccactgcctgacag gtaagtacctgaaagttacggtctgcgaacgctattccacTGCTCCTTTATAGGTAcaacagtatagtctgatgga ccacacggcaaataaggagTAACTCTTTCCGGGTAtgggtatacttcagccaatagccgagaatactgccattccag ccatacccggaaagagttactccttatttgccgtgtggttagtcgcttTACATCGGTAAGGGTAgggattttacagca aaactattaagatttttatgcagatgggtattaaggaGTATTCCCCATGGGTAacatattaatggctctta ttacagtctgttatgtggtggctgttaaTTATCCTAAAGGGGTAtcttaggaatttactt Given p sequences find the most mutually similar length k subsequences one from each sequence argmin s1 sp dist si sj i j dist si sj Hamming distance between si and sj Hundreds of papers many formulations Tompa05 Motif finding by Gibbs Sampling Problem Given p strings and a length k find the most mutually similar length k substring from each string Gibbs sampling is the basis behind a general class of algorithms that is a type of local search It doesn t guarantee good performance but often works well in practice Assumes 1 we know the length k of the motif we are looking for 2 each input sequence contains exactly 1 real instance of the motif Gibbs Sampling Profiles If we knew the starting point of the motif in each sequence we could construct a Sequence Profile PSSM for the motif x1 1 ttgccacaaaataatccgccttcgcaaattgaccTACCTCAATAGCGGTAgaaaaacgcaccactgcctgacag x2 2 gtaagtacctgaaagttacggtctgcgaacgctattccacTGCTCCTTTATAGGTAcaacagtatagtctga x3 3 ccacacggcaaataaggagTAACTCTTTCCGGGTAtgggtatacttcagccaatagccgagaatactgccatt x4 4 ccatacccggaaagagttactccttatttgccgtgtggttagtcgcttTACATCGGTAAGGGTAgggatttt x5 5 aaactattaagatttttatgcagatgggtattaaggaGTATTCCCCATGGGTAacatattaatggctctta x6 6 ttacagtctgttatgtggtggctgttaaTTATCCTAAAGGGGTAtcttaggaatttactt TACCTCAATAGCGGTA TGCTCCTTTATAGGTA TAACTCTTTCCGGGTA TACATCGGTAAGGGTA GTATTCCCCATGGGTA TTATCCTAAAGGGGTA Gibbs Sampling Version 1 Pseudocode Set x1 x2 xp to random positions in each input string repeat until the answer x1 x2 xp doesn t change for i 1 p Build a profile Q using sequences at x1 x2 xp except xi Set xi to where the profile Q matches best in string i def gibbs Seqs k Seqs is a list of strings Find the best motif start with random indices I random randint 0 len x k for x in Seqs LastI None while I LastI LastI list I repeat until nothing changes iterate through every string for i in xrange len Seqs compute the profile for the sequences except i P profile for x j j k for q x j in enumerate zip Seqs I if q i find the place the profile matches best best None for j in xrange len Seqs i k 1 score profile score P Seqs i j j k if score best or best is None best score bestpos j update the ith position with the best I i bestpos return I x j j k for x j in zip Seqs I Gibbs Example gibbs thequickdog browndog dogwood k 3 1 8 1 2 dog row gwo random starting positions 2 8 5 0 dog dog dog F 8 5 0 dog dog dog Small bias toward o in the middle is correct gibbs thequickdog browndog dogwood k 3 1 4 3 1 uic wnd ogw 2 6 2 4 ckd own ood 3 8 5 0 dog dog dog F 8 5 0 dog dog dog gibbs thequickdog browndog dogwood k 3 1 2 0 1 equ bro ogw 2 7 4 2 kdo ndo gwo Might not find F 7 4 2 kdo ndo gwo the optimal Another Example gibbs aaa123 678aaa45 9a7aaab 32aa19a8aaa 3 1 0 5 0 2 aaa a45 9a7 aa1 2 1 3 3 8 aa1 aaa aaa aaa 3 0 3 3 8 aaa aaa aaa aaa F 0 3 3 8 aaa aaa aaa aaa Can be multiple optimal answers Bias toward a in the profile quickly leads to finding the implanted aaa gibbs aaabbb bbbaaabb babaaab ababacaaabac abbbababaaabbbaba 3 1 1 4 0 4 11 aab aab bab aca bbb 2 1 4 4 7 9 aab aab aab aab aab F 1 4 4 7 9 aab aab aab aab aab gibbs aaabbb bbbaaabb babaaab ababacaaabac abbbababaaabbbaba 3 1 0 3 3 3 8 aaa aaa aaa bac aaa 2 0 3 3 6 8 aaa aaa aaa aaa aaa F 0 3 3 6 8 aaa aaa aaa aaa aaa Randomness Gibbs Sampling Run the Gibbs sampling multiple times to make it more likely you find the global optimal Can increase the use of randomness to further avoid getting stuck in local optima by choosing new xi randomly Set x1 x2 xp to random positions in each input string repeat until the best x1 x2 xp doesn t change too often for i 1 p Build a profile Q using sequences at x1 x2 xp except xi Choose xi according to the profile probability distribution of Q in string i Profile Probability Distribution New xi position chosen by previous version of algorithm left out sequence Score Aj of substring starting at each position j ttgccacaaaataatccgccttcgcaaattgacctacctcaatagcggtaccttccctaattactgcctgacag Current Profile Instead of choosing the position with the best match choose a position randomly such that Aj Probability of choosing position j i Ai Lawrence et al Science 1994 Recap Motif finding is the problem of finding a set of common substrings within a set of strings Useful for finding transcription factor binding sites Gibbs sampling repeatedly leave one sequence out and optimize the motif location in the left out sequence Doesn t guarantee finding a good solution but often works


View Full Document

UMD CMSC 423 - Motif Finding

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Loading Unlocking...
Login

Join to view Motif Finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motif Finding and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?