Pairwise Sequence Comparison Stat 246 Spring 2002 Week 5 Sequence comparison topics General Dot concepts plots Global alignments Scoring Gap matrices penalties Dynamic Chance programming or common ancestry Dot Plot This is the earliest simplest and most complete method for comparing two sequences It is possible to filter the plot to minimise noise whilst preserving the obvious relationship This plot can identify regions of similarity internal repeats rearrangement events Sequence 2 along b A Sequence 1 down a Add a guard row and colum A G C C A C A C A dot goes where the two sequences match A C A C A T Connect the dots along diagonals A Extensions to dot plots Modern dot plots are more sophisticated using the notions of window size of diagonal strip centered on an entry over which matching is accumulated and stringency the extent of agreement required over the window before a dot is placed at the central entry e g for a window of size 5 we might require at least 3 matches and then we put a dot in the central spot More complex scoring rules can be used Human globin vs human myoglobin 0 50 100 153 to 100 4 188 ck myo human pep COMPARE Window 30 Stringency 9 0 1 Points 1 097 150 50 0 beta human pep ck 1 242 1 to 146 Human LDL receptor vs itself w 30 s 9 0 200 400 600 800 860 to 600 3 641 ck ldlrecep pep 400 COMPARE Window 30 Stringency 9 0 1 Points 32 253 800 200 0 ldlrecep pep ck 3 641 1 to 860 Human LDL receptor vs itself 40 15 0 200 400 600 800 860 to ck 3 641 1 600 ldlrecep pep COMPARE Window 40 Stringency 15 0 Points 5 287 800 400 200 0 ldlrecep pep ck 3 641 1 to 860 Human LDL receptor vs itself 40 17 5 0 200 400 600 800 860 to 600 3 641 ck ldlrecep pep 400 COMPARE Window 40 Stringency 17 5 1 Points 3 079 800 200 0 ldlrecep pep ck 3 641 1 to 860 Human LDL receptor vs itself 40 20 0 200 400 600 800 860 to ldlrecep pep ck 3 641 1 600 400 COMPARE Window 40 Stringency 20 0 Points 2 295 800 200 0 ldlrecep pep ck 3 641 1 to 860 Plasmodium falciparum MSP3 vs itself 30 9 100 200 300 4 247 1 to 380 300 msp3 pep ck 200 100 COMPARE Window 30 Stringency 9 0 Points 45 071 0 0 msp3 pep ck 4 247 1 to 380 Plasmodium falciparum MSP3 vs itself 20 9 100 200 300 4 247 1 to 380 300 msp3 pep ck 200 COMPARE Window 20 Stringency 9 0 Points 15 619 0 100 0 msp3 pep ck 4 247 1 to 380 Plasmodium falciparum MSP3 vs itself 10 9 100 200 300 ck 4 247 1 to 380 300 msp3 pep 200 COMPARE Window 10 Stringency 9 0 Points 1 263 0 100 0 msp3 pep ck 4 247 1 to 380 Global alignment An alignment a where If we lead and b of can sequences be padded a AGCACAC A b A CACACTA read from a the to Left The left hand The right hand operations trivial two b and with b gap is an symbols arrangement to achieve of the a and same or b by position length AG CACACA ACACACT A column wise Match A A Delete we have a protocol of operations A A G Replace G C Match C C Insert A Match A A Match C C Match C C Match A A Match A A Match C C Match C C Replace A T Insert T Delete C Match A A Match A A Matches alignment shows shows Right edit Match alignment are ones alignment a one one Delete Insert one one Insert Delete and two the other Replaces that edit and some Cost scoring of global alignments optimal global alignments Next we weight turn S define the to each S u u This scheme edit is protocol into operation S u v 0 known as the 1 virtue is weight than replacement used Details For less example a shortly Now analysis all all the The edit possible The a Using the of cost an operations optimal An under cost it unit model is and alignment that S function cost easily model hence of We for lead a an to of S b b A CACACTA in In general acid a b is it our acid the by an the by more a a d a b previous of 1 unit totally b an we the cost for of we one may Its models must should sequence the minimal the or properties alignment costs cost of a of among and following b cost AG CACACA ACACACT A cost that model sum obtain A cost similar has optimal from Cost notion the which example or is cost Unit different important and a u v sophisticated biochemically 2 edit characters called alignment cost assigning S v also most b is with sequences to by arbitrary distance two and distance S u define and of for v amino from denote AGCACAC A the amino a of a seen by an ready alignment cost Here are distance u simplicity alignments edit example for replacing we measure Levenshtein predominant be its For a left hand distance d a b assignment 2 is optimal 4 under the unit cost More general scores costs see later 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM C 9 1 4 P 3 1 1 G 3 0 2 D 3 0 1 A N E Q H R K M I L 1 0 3 4 3 3 1 1 1 0 0 1 3 1 1 1 3 1 1 0 2 2 C 2 1 1 2 1 1 1 1 2 Y W 0 2 2 F 0 1 1 2 5 2 V 7 1 2 2 1 1 1 2 2 1 2 3 0 6 2 1 2 1 1 2 1 1 1 1 1 4 2 2 2 2 S T 2 4 3 0 3 D D S T From Henikoff 1996 0 2 2 2 2 2 3 4 4 6 1 0 0 1 0 0 2 3 3 D R 6 2 0 1 2 1 3 3 4 5 2 0 0 1 2 3 3 5 0 1 1 0 3 2 8 0 1 2 3 3 5 2 1 3 2 5 1 3 2 5 1 2 4 2 4 3 3 3 2 2 3 3 2 1 3 1 3 2 3 2 3 2 1 2 2 2 1 1 1 P A G N D E Q H R K M I L 4 3 2 3 4 3 4 6 0 3 3 3 3 2 1 2 3 3 3 3 0 1 0 3 4 0 1 6 2 3 1 1 V 3 7 F Y 2 11 W 2 Scoring Matrices …
View Full Document