Berkeley STATISTICS 246 - Pairwise Sequence Comparison - D2701627

Home> Schools> University of California, Berkeley> (STATISTICS) > STATISTICS 246> Pairwise Sequence Comparison

DOC PREVIEW

Berkeley STATISTICS 246 - Pairwise Sequence Comparison

School name University of California, Berkeley

Course Statistics 246- Statistical Genetics

Pages 82

This preview shows page 1-2-3-4-5-39-40-41-42-43-44-78-79-80-81-82 out of 82 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 82 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Pairwise Sequence ComparisonStat 246, Spring 2002, Week 5,Sequence comparison: topics General concepts Dot plots Global alignments Scoring matrices Gap penalties Dynamic programming Chance or common ancestry?Dot PlotThis is the earliest, simplest and most completemethod for comparing two sequencesIt is possible to filter the plot to minimise noise whilstpreserving the obvious relationship This plot can identify• regions of similarity• internal repeats• rearrangement eventsA C A C A C T AAGCACACAba.A dot goes where the two sequences matchSequence1 down:Sequence 2along:(Add a “guard” row and colum.)Connect the dotsalong diagonals.Extensions to dot plots Modern dot plots are more sophisticated, using thenotions of window : size of diagonal strip centered on an entry,over which matching is accumulated, and stringency: the extent of agreement required over thewindow, before a dot is placed at the central entry. e.g. for a window of size 5, we might require at least 3matches, and then we put a dot in the central spot.More complex scoring rules can be used.Human ββββ globin vs. human myoglobinaCOMPARE Window: 30 Stringency: 9.0 Points: 1,097myo-human.pep ck: 4,188, 1 to 153beta-human.pep ck: 1,242, 1 to 146050100150100500Human LDL receptor vs. itself (w=30, s=9)aCOMPARE Window: 30 Stringency: 9.0 Points: 32,253ldlrecep.pep ck: 3,641, 1 to 860ldlrecep.pep ck: 3,641, 1 to 86002004006008008006004002000Human LDL receptor vs. itself (40, 15)COMPARE Window: 40 Stringency: 15.0 Points: 5,287ldlrecep.pep ck: 3,641, 1 to 860ldlrecep.pep ck: 3,641, 1 to 86002004006008008006004002000Human LDL receptor vs. itself (40, 17.5)ldlrecep.pep ck: 3,641, 1 to 86002004006008008006004002000COMPARE Window: 40 Stringency: 17.5 Points: 3,079ldlrecep.pep ck: 3,641, 1 to 860Human LDL receptor vs. itself (40, 20)ldlrecep.pep ck: 3,641, 1 to 86002004006008008006004002000COMPARE Window: 40 Stringency: 20.0 Points: 2,295ldlrecep.pep ck: 3,641, 1 to 860Plasmodium falciparum MSP3 vs. itself (30,9)aCOMPARE Window: 30 Stringency: 9.0 Points: 45,071msp3.pep ck: 4,247, 1 to 380msp3.pep ck: 4,247, 1 to 38001002003003002001000Plasmodium falciparum MSP3 vs. itself (20,9)COMPARE Window: 20 Stringency: 9.0 Points: 15,619msp3.pep ck: 4,247, 1 to 380msp3.pep ck: 4,247, 1 to 38001002003003002001000Plasmodium falciparum MSP3 vs. itself (10,9)COMPARE Window: 10 Stringency: 9.0 Points: 1,263msp3.pep ck: 4,247, 1 to 380msp3.pep ck: 4,247, 1 to 38001002003003002001000Global alignmentAn alignment of two sequences a and b is an arrangement of a and b by position,where a and b can be padded with gap symbols to achieve the same length:a: AGCACAC-A or AG-CACACAb: A-CACACTA ACACACT-AIf we read the alignment column-wise, we have a protocol of edit operations thatlead from a to b.Left: Match (A,A) Right: Match (A,A)Delete (G,-) Replace (G,C)Match (C,C) Insert (-,A)Match (A,A) Match (C,C)Match (C,C) Match (A,A)Match (A,A) Match (C,C)Match (C,C) Replace (A,T)Insert (-,T) Delete (C,-)Match (A,A) Match (A,A)The left-hand alignment shows one Delete, one Insert, and the other editoperations are Matches.The right-hand alignment shows one Insert, one Delete, two Replaces, and sometrivial ones.Cost (scoring) of global alignments; optimal globalalignmentsNext we turn the edit protocol into a measure of distance by assigning a “cost” or“weight” S to each operation. For example, for arbitrary characters u,v from A we maydefine S(u,u) = 0; S(u,v) = 1 for u ≠ v; S(u,-) = S(-,v) = 1. (Unit Cost)This scheme is known as the Levenshtein distance, also called unit cost model. Itspredominant virtue is its simplicity. In general, more sophisticated cost models mustbe used. For example, replacing an amino acid by a biochemically similar one shouldweight less than a replacement by an amino acid with totally different properties.Details shortly. Now we are ready to define the most important notion for sequenceanalysis:The cost of an alignment of two sequences a and b is the sum of the costs ofall the edit operations that lead from a to b.An optimal alignment of a and b is an alignment which has minimal cost amongall possible alignments.The edit distance of a and b is the cost of an optimal alignment of a and bunder a cost function S. We denote it by d(a,b).Using the unit cost model for S in our previous example, we obtain the following cost:a: AGCACAC-A or AG-CACACAb: A-CACACTA ACACACT-Acost: 2 cost: 4Here it is easily seen that the left-hand assignment is optimal under the unit costmodel, and hence the edit distance d(a,b) = 2.More general scores = - costs: see later.C9S-1 4T-1 1 5P-3 -1 -1 7A0 1 0 -1 4G-3 0 -2 -2 0 6N-3 1 0 -2 -2 0 6D-3 0 -1 -1 -2 -1 1 6E-4 0 -1 -1 -1 -2 0 2 5Q-3 0 -1 -1 -1 -2 0 0 2 5H-3 -1 -2 -2 -2 -2 1 -1 0 0 8R-3 -1 -1 -2 -1 -2 0 -2 0 1 0 5K-3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5M-1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5I-1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4L-1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4V-1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4F-2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6Y-2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7W-2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11C S T P A G N D E Q H R K M I L V F Y W134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || ||137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVMD:D = +6D:R = -2From Henikoff 1996Scoring Matrices Physical/Chemical similaritiescomparing two sequences according to theproperties of their residues may highlightregions of structural similarity Identity matricesby stressing only identities in the alignment,stretches of sequence that may have divergedwill not penalise any remaining commonfeaturesScoring Matrices (ctd)As the direct source of residue by residue comparison scoresthe scoring matrix you choose will have a major impact on thealignment calculated The most commonly used will be one of the mutation matricesPAM or BLOSUMVon Bing will explain the derivation of these and other mutationmatrices next Tuesday.The matrix that performs best will be the matrix that best reflectsthe evolutionary separation of the sequences being aligned.Statistical motivation for alignment scorespr(data|H) = pr( |H) = pr( |H) x ... = (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8αt)pr(data|R) = pr(

View Full Document