New version page

UMD CMSC 423 - Motif Finding

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Upgrade to remove ads

This preview shows page 1-2-3-4 out of 11 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

Motif FindingCMSC 423Motif FindingGiven p sequences, find the most mutually similar length-k subsequences, one from each sequence:dist(si,sj) = Hamming distance between si and sj.1. ttgccacaaaataatccgccttcgcaaattgaccTACCTCAATAGCGGTAgaaaaacgcaccactgcctgacag2. gtaagtacctgaaagttacggtctgcgaacgctattccacTGCTCCTTTATAGGTAcaacagtatagtctgatgga3. ccacacggcaaataaggagTAACTCTTTCCGGGTAtgggtatacttcagccaatagccgagaatactgccattccag4. ccatacccggaaagagttactccttatttgccgtgtggttagtcgcttTACATCGGTAAGGGTAgggattttacagca5. aaactattaagatttttatgcagatgggtattaaggaGTATTCCCCATGGGTAacatattaatggctctta6. ttacagtctgttatgtggtggctgttaaTTATCCTAAAGGGGTAtcttaggaatttacttTranscription factorargmins1,...,sp!i<jdist(si,sj)Hundreds of papers, many formulations (Tompa05)Motif-finding by Gibbs Sampling“Gibbs sampling” is the basis behind a general class of algorithms that is a type of local search.It doesn’t guarantee good performance, but often works well in practice.Assumes:1. we know the length k of the motif we are looking for.2. each input sequence contains exactly 1 real instance of the motif.Problem. Given p strings and a length k, find the most “mutually similar” length-k substring from each string.Gibbs Sampling: Profiles1. ttgccacaaaataatccgccttcgcaaattgaccTACCTCAATAGCGGTAgaaaaacgcaccactgcctgacag2. gtaagtacctgaaagttacggtctgcgaacgctattccacTGCTCCTTTATAGGTAcaacagtatagtctga3. ccacacggcaaataaggagTAACTCTTTCCGGGTAtgggtatacttcagccaatagccgagaatactgccatt4. ccatacccggaaagagttactccttatttgccgtgtggttagtcgcttTACATCGGTAAGGGTAgggatttt5. aaactattaagatttttatgcagatgggtattaaggaGTATTCCCCATGGGTAacatattaatggctctta6. ttacagtctgttatgtggtggctgttaaTTATCCTAAAGGGGTAtcttaggaatttacttIf we knew the starting point of the motif in each sequence, we could construct a Sequence Profile (PSSM) for the motif:x1x2x3x4x5x6TACCTCAATAGCGGTATGCTCCTTTATAGGTATAACTCTTTCCGGGTATACATCGGTAAGGGTAGTATTCCCCATGGGTATTATCCTAAAGGGGTAGibbs Sampling, Version 1: PseudocodeSet (x1, x2, ..., xp) to random positions in each input string.repeat until the answer (x1, x2, ..., xp) doesn’t changefor i = 1 ... p:Build a profile Q using sequences at (x1, x2, ..., xp) except xiSet xi to where the profile Q matches best in string i.def gibbs(Seqs, k): """Seqs is a list of strings. Find the best motif.""" # start with random indices I = [random.randint(0, len(x) - k) for x in Seqs] LastI = None while I != LastI: # repeat until nothing changes LastI = list(I) # iterate through every string for i in xrange(len(Seqs)): # compute the profile for the sequences except i P = profile_for([ x[j : j + k] for q, (x, j) in enumerate(zip(Seqs, I)) if q != i ]) # find the place the profile matches best best = None for j in xrange(len(Seqs[i]) - k + 1): score = profile_score(P, Seqs[i][j : j + k]) if score > best or best is None: best = score bestpos = j # update the ith position with the best I[i] = bestpos return I, [x[j : j + k] for x, j in zip(Seqs, I)]Gibbs Examplegibbs(["thequickdog", "browndog", "dogwood"], k=3)1: [8, 1, 2] ['dog', 'row', 'gwo']2: [8, 5, 0] ['dog', 'dog', 'dog']F: [8, 5, 0] ['dog', 'dog', 'dog']gibbs(["thequickdog", "browndog", "dogwood"], k=3)1: [4, 3, 1] ['uic', 'wnd', 'ogw']2: [6, 2, 4] ['ckd', 'own', 'ood']3: [8, 5, 0] ['dog', 'dog', 'dog']F: [8, 5, 0] ['dog', 'dog', 'dog']gibbs(["thequickdog", "browndog", "dogwood"], k=3)1: [2, 0, 1] ['equ', 'bro', 'ogw']2: [7, 4, 2] ['kdo', 'ndo', 'gwo']F: [7, 4, 2] ['kdo', 'ndo', 'gwo']random starting positionsSmall bias toward “o” in the middle is correct.Might not find the optimal.Another Examplegibbs(["aaa123", "678aaa45", "9a7aaab", "32aa19a8aaa"], 3)1: [0, 5, 0, 2] ['aaa', 'a45', '9a7', 'aa1']2: [1, 3, 3, 8] ['aa1', 'aaa', 'aaa', 'aaa']3: [0, 3, 3, 8] ['aaa', 'aaa', 'aaa', 'aaa']F: [0, 3, 3, 8] ['aaa', 'aaa', 'aaa', 'aaa']Bias toward “a” in the profile quickly leads to finding the implanted “aaa”gibbs(["aaabbb", "bbbaaabb", 'babaaab', 'ababacaaabac', 'abbbababaaabbbaba'], 3)1: [1, 4, 0, 4, 11] ['aab', 'aab', 'bab', 'aca', 'bbb']2: [1, 4, 4, 7, 9] ['aab', 'aab', 'aab', 'aab', 'aab']F: [1, 4, 4, 7, 9] ['aab', 'aab', 'aab', 'aab', 'aab']gibbs(["aaabbb", "bbbaaabb", 'babaaab', 'ababacaaabac', 'abbbababaaabbbaba'], 3)1: [0, 3, 3, 3, 8] ['aaa', 'aaa', 'aaa', 'bac', 'aaa']2: [0, 3, 3, 6, 8] ['aaa', 'aaa', 'aaa', 'aaa', 'aaa']F: [0, 3, 3, 6, 8] ['aaa', 'aaa', 'aaa', 'aaa', 'aaa']Can be multiple optimal answersRandomness: Gibbs Sampling•Run the Gibbs sampling multiple times to make it more likely you find the global optimal.•Can increase the use of randomness to further avoid getting stuck in local optima by choosing new xi randomly.Set (x1, x2, ..., xp) to random positions in each input string.repeat until the best (x1, x2, ..., xp) doesn’t change too oftenfor i = 1 ... p:Build a profile Q using sequences at (x1, x2, ..., xp) except xiChoose xi according to the profile probability distribution of Q in string i.Profile Probability DistributionttgccacaaaataatccgccttcgcaaattgacctacctcaatagcggtaccttccctaattactgcctgacagCurrent Profileleft-out sequenceScore Aj of substring starting at each position j.New xi position chosen by previous version of algorithmInstead of choosing the position with the best match, choose a position randomly such that:Probability of choosing position j =Aj�iAi(Lawrence, et al., Science, 1994)Recap•“Motif finding” is the problem of finding a set of common substrings within a set of strings.•Useful for finding transcription factor binding sites.-Gibbs sampling: repeatedly leave one sequence out and optimize the motif location in the left-out sequence.-Doesn’t guarantee finding a good solution, but often


View Full Document
Download Motif Finding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Motif Finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motif Finding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?