CMSC423 Fall 2008 1CMSC423: Bioinformatic Algorithms, Databases and ToolsLecture 9inexact alignmentdynamic programming, gapped alignmentCMSC423 Fall 2008 2Recap3Global alignment recapC--AGACTGA G GATGCAGCGTAGGTCAGACValue(A,A) = 10Value(A,G) = -5Value(A,-) = -2Score[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)4Global alignment recap1920910140-14-28C-28-24-20-16-12-8-40--24-20-16-12-8-4-2024131434-10AGACTG2410141848-61014378-6-2-13-9-5A-226G-134812048-31-14-10-6-22GATGCAG-C-GTAG-GTCAG-ACValue(A,A) = 10Value(A,G) = -5Value(A,-) = -4Score[i,j] is the maximum of:1. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)5Local alignment recapC--AGACTGA G GATGCAGCGTAGGTCAGACValue(A,A) = 10Value(A,G) = -5Value(A,-) = -2Score[i,j] is the maximum of:0. 01. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned)2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap)3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap)CMSC423 Fall 2008 6Alignment scoresCMSC423 Fall 2008 7Where do the alignment scores come from?• PAM matrices– PAM1 – based on frequency of mutations between closely related proteins (within 1 "evolutionary step")– PAM 2 - ... within 2 evolutionary steps–... PAM 250 – commonly used•BLOSUM matrices–Frequency of mutations between proteins that are x% similar– BLOSUM100 – based on proteins that are exactly the same (e.g. score(A,A) is defined but not score(A,G) )– BLOSUM62 – commonly used• gap scores usually determined empiricallyCMSC423 Fall 2008 8BLOSUM62CMSC423 Fall 2008 9HeuristicsCMSC423 Fall 2008 10Heuristics• What if limit the # of differences allowed? E.g. we expect the sequences to be very similar.• Compute 'banded' alignment – stay within # of differences (k) from the diagonal.•Optimal alignment cannot stray too far from diagonal•What if we do not know k? Do binary search to find itkkO(km) runningtime and spaceCMSC423 Fall 2008 11Exclusion methods• Assume P must match T with at most k errors. Find places in T where P cannot match.•Split P into floor(n/k+1)-sized chunks.• If P matches T with less than k errors => at least one chunk matches with no errors•Use any exact matching algorithm to find places where a chunk matches T, then run dynamic programming in that vicinity. • Running time, on average O(m)CMSC423 Fall 2008 12Exclusion methodsExact matchPutative alignmentTextPatternCMSC423 Fall 2008 13"Famous" approaches• FASTA (Pearson et al.)– Take all k-mers (substrings of length k) from Pattern and identify whether and where they match in the Text– Assume the k-mer starting at pos'n i in Pattern matches at position j in Text, remember (j – i) – the diagonal on which the match occured–Identify "heavy" diagonals – diagonals where many k-mers match, then refine the diagonals with Smith Waterman–Also look for off-diagonal matches to account for gapsCMSC423 Fall 2008 14"Famous" approaches• BLAST (Altschul et al.)– Find short k-mer matches–Also search for possible inexact matches, e.g. all k-mers within 1 difference from current one.–Extend exact matches with Smith-Waterman algorithm– Assign probabilistic scores to matches: what is the probability of finding a match with the same S-W alignment score just by chance (e.g. matching a random string)?CMSC423 Fall 2008 15Chaining approach• Extends the FASTA idea• Search for exact matches•Find the longest consistent chain of exact matches•Fill in the gaps in the chain using Smith-Waterman• This is the approach used by MUMmer (Delcher et al.)•MUM – maximally unique match (see mummer.sourceforge.net)CMSC423 Fall 2008 16Chaining in 1-D • Input: multiple overlapping intervals on a line•Output: highest weight set of non-overlapping intervals• Weight could be length of interval, or Smith-Waterman score, etc.•Sort the endpoints (starts, ends) of the intervals• For every interval j, store V[j] – best score of a chain ending in j• MAX – store highest V[j] seen sofar•Process endpoints in increasing order of x coordinate• If we encounter left end (start) of interval j–V[j] = weight(j) + MAX• If we encounter right end (end) of interval j– MAX = max{V[j], MAX}• Running
View Full Document