DOC PREVIEW
UMD CMSC 423 - Sequence Alignment

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Sequence AlignmentCMSC 423: Sequence AlignmentSlides By: Carl KingsfordDepartment of Computer ScienceUniversity of Maryland, College ParkBased on Section 6.6 of Algorithm Design by Kleinberg & Tardos.The ProblemGiven: Two stringsa = a1a2a3a4. . . amb = b1b2b3b4. . . bnai, bi∈ L for some alphabet L like {A, C , G , T }.Compute how “similar” the two strings are.What do we mean by similarity between two strings?Alignment Examplesprin-ciple|||| |||XXprinncipal(1 gap, 2 mm)misspell||| ||||mis-pell(1 gap)aa-bb-ccaabb|X || | | |ababbbc-a-b-(5 gaps, 1 mm)prin-cip-le|||| ||| |prinncipal-(3 gaps, 0 mm)prehistoric||||||||---historic(3 gaps, 0 mm)al-go-rithm-|| XX ||X |alKhwariz-mi(4 gaps, 3 mm)Motivation•Alignment is used extensively in molecular biology, where aand b are the DNA sequences of two genes(see NCBI BLAST)•Spell checkers•Inexact search of web pagesNCBI BLASTNCBI BLAST Alignment>gb|AC115706.7| Mus musculus chromosome 8, clone RP23-382B3, complete sequenceQuery 1650 gtgtgtgtgggtgcacatttgtgtgtgtgtgcgcctgtgtgtgtgggtgcctgtgtgtgt 1709|||||||||| | || | ||||||||| | |||||||| ||| || |||||Sbjct 56838 GTGTGTGTGGAAGTGAGTTCATCTGTGTGTGCACATGTGTGTGCA--TGCATGCATGTGT 56895Query 1710 gtg-gggcacatttgtgtgtgtgtgtgtgcctgtgtgtgggtgcacatttgtgtgtgtgc 1768|| ||||| || ||| ||||||| |||||||| ||| ||| ||||| || |Sbjct 56896 GTCCGGGCA------TGCATGTCTGTGTGCATGTGTGTGTGTGTGCAT--GTGTGAGTAC 56947Query 1769 ctgtgtgtgtgtgcctgtgtgtgggggtgcacatttgtgtgtgtgtgtgcctgtgtgtgg 1828|||||||||| ||| ||| |||| | ||| ||| ||||| |||||| ||||| |Sbjct 56948 CTGTGTGTGTATGCTTGTATGTGTGTGTGTGCATGTGTGTAGGTGTGTATATGTGTAAGT 57007Query 1829 gggtgcacatttgtgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgtgt 1888||| ||||||| |||||| |||| | ||| |||| |||||||||| ||Sbjct 57008 T------CATCTGTGTGTATGTGTG--TGTGAGAGTGCATGCA----TGTGTGTGTGAGT 57055Query 1889 gcctgtgtgt--gtgggtgcacatttgtgtgtgtgtgcctgtg--tgtgt--gggtgcac 1942| | ||||| ||| ||| || || | | | ||||| ||||| | ||| |Sbjct 57056 TCATCTGTGTCAGTGTATGCTTATGGGTATAACT-TAACTGTGCATGTGTAAGTGTGTTC 57114Query 1943 atttgtgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgcctgtgtgtgg 2002|| ||||| |||||||| |||||| || || || | |||||||| |||||||Sbjct 57115 ATCTGTGTATGTGTGTG--TGTGTGAGTTAGTTCA----TCTGTGTGTGAGAGTGTGTGA 57168Query 2003 gtgcacatttgtgtgtgtgtgcctgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgt 2062| | ||| |||||||| || | | ||| || ||| ||||| |||| ||| ||| ||Sbjct 57169 G--CTCATCTGTGTGTGAGTTCATCTGTATGAGTG--TGTGTATGTGTGTGTACAAATGA 57224Query 2063 gtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgtgtgtgcctgtgtgtgt 2122|| | ||||| |||||||||| ||| |||||| | || |||| ||||Sbjct 57225 GTTCATCTGTGCATGTGTGTGTG--------TTTAAGTGTGTTCATCTG--TGTGCGTGT 57274Query 2123 gtTGCAGGCCCTGGATGCCAGACACTGA 2150|| | |||| ||||||||||||Sbjct 57275 GTGGA------TGGAAGCCAGACACTGA 57296Cost FunctionParameters:•“gap” is the cost of leaving a gap.•cost(x, y) is the cost of matchingcharacter x with character y.Objective function:•Cost of an alignment is sum of the costs of the matches plusthe number of gaps.The Problem as a MatchingEach sequence is a set of nodes (one for each character).Looking for a low-cost matching between the two sequences:x x y z x xy y x x yxa=b=Cost of the matching is:gap × #unmatched +X(ai,bj)cost(ai, bj)“No-crossing” rule!Consider the last character in each stringConsider the last character in each string:a = a1a2a3a4. . . amb = b1b2b3b4. . . bnConsider the 4 possibilities:1 (am, bn) are matched to each other2 amis not matched at all3 bnis not matched at all4 amis matched to some bj(j 6= n), and bnis matched to someak(k 6= m).The Three PossibilitiesThis last possibility can’t happen:amis matched to some bj, and bnis matched to some ak.If that occurred, then there would be a crossing in the matching:ambnakbkSo, only possibilities are:1 (am, bn) is in the matching2 amis not matched3 bnis not matchedRecurrenceSuppose, for now, we aren’t interested in the actual alignment, butonly the cost of the optimal alignment.Turn the 3 possibilities into a recurrence:OPT (i, j) = mincost(ai, bj) + OPT (i − 1, j − 1) match ai, bjgap + OPT (i − 1, j) aiis not matchedgap + OPT (i, j − 1) bjis not matchedBase case: OPT (i, 0) = i × gap and OPT (0, j) = j × gap.(matching i characters to 0 characters must use i gaps.)Matrix0 1 2 3 4 5 6 7 8 9 10 11 1298765432109g8g7g6g5g4g3g2g1g0 1g 2g 3g 4g 5g 6g 7g 8g 9g10g 11g 12gOPT(i, j)OPT(i, j-1)OPT(i-1, j)OPT(i-1, j-1)ijOrder to Fill in Cells0 1 2 3 4 5 6 7 8 9 10 11 1298765432109g8g7g6g5g4g3g2g1g0 1g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12gCodeAlign(X,Y):For i = 1,...,m: A[i,0] = i*gapFor j = 1,...,n: A[0,j] = j*gapFor i = 1,...,m:For j = 1,...,n:A[i,j] = min(cost(a[i],b[j]) + A[i-1,j-1],gap + A[i-1,j],gap + A[i,j-1])EndForEndForReturn A[m,n]Running TimeNumber of subproblems = m × n.Each subproblem takes constant time to solve.Total running time = O(mn).Finding the Actual Alignment0 1 2 3 4 5 6 7 8 9 10 11 1298765432109g8g7g6g5g4g3g2g1g0 1g 2g 3g 4g 5g 6g 7g 8g 9g10g 11g 12gOPT(i, j)OPT(i, j-1)OPT(i-1, j)OPT(i-1, j-1)ijTraceback0 1 2 3 4 5 6 7 8 9 10 11 1298765432101g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12g9g8g7g6g5g4g3g2g1g0How do you output the result?ACGTA-GAFollow the backtrack pointers starting from entry (n, m):•If you follow a diagonal pointer, add both characters to thealignment.•If you follow a left pointer, add a gap to the y-axis string andthe x-axis character.•If you follow a down pointer, add a gap to the x-axis stringand the y-axis character.Recasting as a Graph(m,n)(0,0)a1a2a3a4a5b1b2b3b4gapgapedge from (i-1,j-1) to (i,j)has weight cost(ai,bj)Recasting as a GraphTheoremLet f (i, j) be the cost of the shortest path from (0, 0) to (i, j).Then OPT (i, j) = f (i, j) for all i, j.General DP PrinciplesMain Idea of Dynamic Programming: solve the subproblems inan order that makes sure when you need an answer, it’s alreadybeen computed.1 Optimal value of the original problem can be computed easilyfrom some subproblems.2 There are only a polynomial # of subproblems.3 There is a “natural” ordering of the subproblems fromsmallest to largest such that you can obtain the solution for asubproblem by only looking at smaller subproblems.Longest Common Subsequence(LCS)Longest Common Subsequence (LCS)Another example of the use of dynamic programming.Problem: Given 2 strings of DNA a1, . . . , anand b1, . . . , bm, findthe longest subsequence that is shared by both strings.Example:aact tc gga a| || | | | => ttcgaatatctg acacNote: the letters need not be consecutive in either string.LCS


View Full Document

UMD CMSC 423 - Sequence Alignment

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Sequence Alignment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Sequence Alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sequence Alignment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?