New version page

# UMD CMSC 423 - Motif Finding

Pages: 11
Documents in this Course

15 pages

3 pages

26 pages

5 pages

13 pages

18 pages

21 pages

16 pages

8 pages

3 pages

20 pages

17 pages

2 pages

32 pages

18 pages

31 pages

31 pages

16 pages

20 pages

28 pages

9 pages

2 pages

17 pages

9 pages

22 pages

15 pages

18 pages

14 pages

13 pages

2 pages

16 pages

2 pages

14 pages

20 pages

## This preview shows page 1-2-3-4 out of 11 pages.

View Full Document
Do you want full access? Go Premium and unlock all 11 pages.
Do you want full access? Go Premium and unlock all 11 pages.
Do you want full access? Go Premium and unlock all 11 pages.
Do you want full access? Go Premium and unlock all 11 pages.

Unformatted text preview:

Motif FindingCMSC 423Motif FindingGiven p sequences, find the most mutually similar length-k subsequences, one from each sequence:dist(si,sj) = Hamming distance between si and sj.1. ttgccacaaaataatccgccttcgcaaattgaccTACCTCAATAGCGGTAgaaaaacgcaccactgcctgacag2. gtaagtacctgaaagttacggtctgcgaacgctattccacTGCTCCTTTATAGGTAcaacagtatagtctgatgga3. ccacacggcaaataaggagTAACTCTTTCCGGGTAtgggtatacttcagccaatagccgagaatactgccattccag4. ccatacccggaaagagttactccttatttgccgtgtggttagtcgcttTACATCGGTAAGGGTAgggattttacagca5. aaactattaagatttttatgcagatgggtattaaggaGTATTCCCCATGGGTAacatattaatggctctta6. ttacagtctgttatgtggtggctgttaaTTATCCTAAAGGGGTAtcttaggaatttacttTranscription factorargmins1,...,sp!i<jdist(si,sj)Hundreds of papers, many formulations (Tompa05)Motif-finding by Gibbs Sampling“Gibbs sampling” is the basis behind a general class of algorithms that is a type of local search.It doesn’t guarantee good performance, but often works well in practice.Assumes:1. we know the length k of the motif we are looking for.2. each input sequence contains exactly 1 real instance of the motif.Problem. Given p strings and a length k, find the most “mutually similar” length-k substring from each string.Gibbs Sampling: Profiles1. ttgccacaaaataatccgccttcgcaaattgaccTACCTCAATAGCGGTAgaaaaacgcaccactgcctgacag2. gtaagtacctgaaagttacggtctgcgaacgctattccacTGCTCCTTTATAGGTAcaacagtatagtctga3. ccacacggcaaataaggagTAACTCTTTCCGGGTAtgggtatacttcagccaatagccgagaatactgccatt4. ccatacccggaaagagttactccttatttgccgtgtggttagtcgcttTACATCGGTAAGGGTAgggatttt5. aaactattaagatttttatgcagatgggtattaaggaGTATTCCCCATGGGTAacatattaatggctctta6. ttacagtctgttatgtggtggctgttaaTTATCCTAAAGGGGTAtcttaggaatttacttIf we knew the starting point of the motif in each sequence, we could construct a Sequence Profile (PSSM) for the motif:x1x2x3x4x5x6TACCTCAATAGCGGTATGCTCCTTTATAGGTATAACTCTTTCCGGGTATACATCGGTAAGGGTAGTATTCCCCATGGGTATTATCCTAAAGGGGTAGibbs Sampling, Version 1: PseudocodeSet (x1, x2, ..., xp) to random positions in each input string.repeat until the answer (x1, x2, ..., xp) doesn’t changefor i = 1 ... p:Build a profile Q using sequences at (x1, x2, ..., xp) except xiSet xi to where the profile Q matches best in string i.def gibbs(Seqs, k): """Seqs is a list of strings. Find the best motif.""" # start with random indices I = [random.randint(0, len(x) - k) for x in Seqs] LastI = None while I != LastI: # repeat until nothing changes LastI = list(I) # iterate through every string for i in xrange(len(Seqs)): # compute the profile for the sequences except i P = profile_for([ x[j : j + k] for q, (x, j) in enumerate(zip(Seqs, I)) if q != i ]) # find the place the profile matches best best = None for j in xrange(len(Seqs[i]) - k + 1): score = profile_score(P, Seqs[i][j : j + k]) if score > best or best is None: best = score bestpos = j # update the ith position with the best I[i] = bestpos return I, [x[j : j + k] for x, j in zip(Seqs, I)]Gibbs Examplegibbs(["thequickdog", "browndog", "dogwood"], k=3)1: [8, 1, 2] ['dog', 'row', 'gwo']2: [8, 5, 0] ['dog', 'dog', 'dog']F: [8, 5, 0] ['dog', 'dog', 'dog']gibbs(["thequickdog", "browndog", "dogwood"], k=3)1: [4, 3, 1] ['uic', 'wnd', 'ogw']2: [6, 2, 4] ['ckd', 'own', 'ood']3: [8, 5, 0] ['dog', 'dog', 'dog']F: [8, 5, 0] ['dog', 'dog', 'dog']gibbs(["thequickdog", "browndog", "dogwood"], k=3)1: [2, 0, 1] ['equ', 'bro', 'ogw']2: [7, 4, 2] ['kdo', 'ndo', 'gwo']F: [7, 4, 2] ['kdo', 'ndo', 'gwo']random starting positionsSmall bias toward “o” in the middle is correct.Might not find the optimal.Another Examplegibbs(["aaa123", "678aaa45", "9a7aaab", "32aa19a8aaa"], 3)1: [0, 5, 0, 2] ['aaa', 'a45', '9a7', 'aa1']2: [1, 3, 3, 8] ['aa1', 'aaa', 'aaa', 'aaa']3: [0, 3, 3, 8] ['aaa', 'aaa', 'aaa', 'aaa']F: [0, 3, 3, 8] ['aaa', 'aaa', 'aaa', 'aaa']Bias toward “a” in the profile quickly leads to finding the implanted “aaa”gibbs(["aaabbb", "bbbaaabb", 'babaaab', 'ababacaaabac', 'abbbababaaabbbaba'], 3)1: [1, 4, 0, 4, 11] ['aab', 'aab', 'bab', 'aca', 'bbb']2: [1, 4, 4, 7, 9] ['aab', 'aab', 'aab', 'aab', 'aab']F: [1, 4, 4, 7, 9] ['aab', 'aab', 'aab', 'aab', 'aab']gibbs(["aaabbb", "bbbaaabb", 'babaaab', 'ababacaaabac', 'abbbababaaabbbaba'], 3)1: [0, 3, 3, 3, 8] ['aaa', 'aaa', 'aaa', 'bac', 'aaa']2: [0, 3, 3, 6, 8] ['aaa', 'aaa', 'aaa', 'aaa', 'aaa']F: [0, 3, 3, 6, 8] ['aaa', 'aaa', 'aaa', 'aaa', 'aaa']Can be multiple optimal answersRandomness: Gibbs Sampling•Run the Gibbs sampling multiple times to make it more likely you find the global optimal.•Can increase the use of randomness to further avoid getting stuck in local optima by choosing new xi randomly.Set (x1, x2, ..., xp) to random positions in each input string.repeat until the best (x1, x2, ..., xp) doesn’t change too oftenfor i = 1 ... p:Build a profile Q using sequences at (x1, x2, ..., xp) except xiChoose xi according to the profile probability distribution of Q in string i.Profile Probability DistributionttgccacaaaataatccgccttcgcaaattgacctacctcaatagcggtaccttccctaattactgcctgacagCurrent Profileleft-out sequenceScore Aj of substring starting at each position j.New xi position chosen by previous version of algorithmInstead of choosing the position with the best match, choose a position randomly such that:Probability of choosing position j =Aj�iAi(Lawrence, et al., Science, 1994)Recap•“Motif finding” is the problem of finding a set of common substrings within a set of strings.•Useful for finding transcription factor binding sites.-Gibbs sampling: repeatedly leave one sequence out and optimize the motif location in the left-out sequence.-Doesn’t guarantee finding a good solution, but often

View Full Document