Columbia CS 4761 - Sequence Alignment - D2944808

Home> Schools> Columbia University> (CS) > CS 4761> Sequence Alignment

Columbia CS 4761 - Sequence Alignment

Pages 10

Download Save

Unformatted text preview:

1COMS4761-- 2007Prof. Yechiam Yemini (YY)Computer Science DepartmentColumbia UniversityChapter 2: Sequence Alignment2.3 Searching Sequence Databases; FASTA, BLASTCOMS4761-- 2007 2 How to search a sequence database (DB) for local alignments of a query sequence? E.g., Search a promoter sequence in a DB of 105 sequencesDynamic Programming is prohibitively complexNeed techniques that are: Fast: focus search on likely solutions (trade speed for completeness) Tunable: retrieve meaningful alignments (ones with sufficiently high score)FASTA & BLASTThe ProblemTutorial: W.R. Pearson "Protein sequence comparison and Protein evolution Tutorial - ISMB2000" (October, 2001)http://www.people.virginia.edu/%7Ewrp/papers/ismb2000.pdfq=TACGAAT..ATAAGAATATACGAATCCACGAT..TCGATACGTTAGCAATACTAG…CGAAATATAGGTTAGCAATAC..ACGACATCGAAGAATAAATAT..……………..???acACGAATaTACGAATccACGA-T.._tACGAAT-TACGAAT-tACGAaT__2COMS4761-- 2007 3Reconsider DP Geometry Diagonal matching segments provide the basis for alignments Alignment may be viewed as connecting matching diagonals Using mismatched diagonals or horizontal/vertical gapped segments Scoring is additive contributions of matching diagonal and connectors Mismatched diagonals or vertical/horizontal gapped segments may reduce the score It is best to focus on high scoring diagonals and use connectors with positive scoreCOMS4761-- 2007 4Dot Matrix HeuristicsRule 1: Find high-scoring diagonals Search small diagonal segments Extend to max diagonal matches Connect diagonals to max scoreRule 2: Focus on meaningful alignments Filter low-scoring diagnonals* ** **** * *** * ** ** * ** * ** * ** ** ** * *** ** **3COMS4761-- 2007 5Tradeoff: Time vs. OptimalitySmith-Waterman FASTA Blast10 min 2 min 20 secCOMS4761-- 2007 6FASTAKey idea (Pearson & Lipman 88): Find short diagonals by indexing the DB Extend these to high scoring diagonals Use DP to connect themA 4 steps process4COMS4761-- 2007 7Step (a): Find Diagonal Matches by Indexing Key idea: create an index of k-tuples of the DB Scan database to index k-tuples [k=1..5] Scan query to index k-tuples Find all diagonal matches of length k by comparing the hash tables Merge these short diagonals into maximal diagonal matches Example:Database d: TATCGATCGAPosition:1 2 3 4 5 6 7 8 9 10Query q: GATCGPosition:1 2 3 4 5TATCGATCGAGATCGTATATCTCGCGAGAT12,63,74,85dKtup=31. Extract index231q0,-40,-4-4Offset=q-d2. Find matches 3. Merge diagonal matchesCOMS4761-- 2007 8FASTA Steps (b-d): Optimize Scoreb) Filter low-score diagonalsc) Extend diagonals to maxscore; keep high-scoringsegmentsd) Use DP for a narrow bandaround the high scoringsegments5COMS4761-- 2007 9ExampleCOMS4761-- 2007 10BLAST: Basic Local Alignment SearchAltschul & Karlin [1990]; a family of algorithmsIdea: find matches with significant score statistics Find maximal segment pairs (MSP): segments with significant score Based on extensive statistical theory (summarized soon)Base Algorithm: Step 1: index DB for words of size W (W-mers); index query sequence for W-mers with score >Thresholdo W= 3 for protein, 11 for nucleotides Step 2: search for matches with high score (HSP=high scoring pairs) Step 3: extend hits to maximal score segments Step 4: report matches with score above S6COMS4761-- 2007 11BLAST Step 1-3: Finding Short High-Scoring Pairs (HSP) Create an Index of W-mers for database & query For proteins W=3 means a dictionary of 203=8000 words Match W-mers that score above a threshold T FASTA searches for exact matches of ktuples BLAST, in contrast, searches for high scoring pairs (HSP) Key idea: exploit the fast part of the search to max the scorerather than push the maximization for later, slower, phasesFrom A. Baxevanis: “Nucleotide and Protein Sequence Analysis I”via Kellis & Indyk, MIT, “BLAST & Database Search, Lecture 2”COMS4761-- 2007 12Blast Steps 3-4: Extending Short HSPsThe short HSPs are extended to increase the scoreMax score extensionScoreReport above threshold HSPs and their scores7COMS4761-- 2007 13ExampleCOMS4761-- 2007 14Statistics Background How do we distinguish “meaningful” alignment from a random one? E.g., suppose an alignment of a query q with a sequence d scores s, is it “meaningful”? If s is much higher than the average score of a random alignment, the answer is positive Key idea: use statistics of alignment scores to distinguish “meaningful” Basic probability: Given a sample space (e.g., possible alignments) A random variable is a real-valued function of samples (e.g., alignment score) The statistics of a random variable is described by a distribution function:F(x)=P[S <x ] with a density function f(x)=dF/dx…. f(x)Δx~p[x-Δx≤S≤x +Δx]x µσσf(x)(Score)Standard deviation, σ=√[E(S2)-E(S)2]Mean, µ=E(S)= sf(x)dx1 F(x)8COMS4761-- 2007 15Distinguishing Meaningful From Random “meaningful” ~ score is at the tail of the distribution Z-Score: Z(s)=(s-µ)/σ [z(s) >7  d is meaningful] P-Score: p=p[S>s]=1-F(s) [p<0.02 d is meaningful]s µσσf(s)F(s)s Z=(s-µ)/σ1 p=1-F(s)Meaningful ~ tail scoreCOMS4761-- 2007 16How Significant is An Alignment? Key idea: Consider the highest matching score S as a random variableand use Z-score to determine whether an alignment is meaningfulStrategy: Define the highest matching score S as a random variableo Let q be a query and d a sequence from a database Do Define S(q,d)=Max{ s(q,q’)| where s(q,q’) is the score of an alignment of q and a subsequence q’ of d}o S is a random variable defined over the space of local alignments {(q,d)} Suppose S has mean µ and standard deviation σ Use Z-score, Z(s)=(s-µ)/σ to determine significance of a local alignment scoring s For protein sequences Z(s) >7 is considered significantKey challenge: how do we determine the distribution of S ?Karlin, S., and Altschul, S. F. (1990) ``Method for assessing the statistical significance of molecular sequence features by using general scoring schemes,'' Proceedings of the National Academy of Science, USA 87, 2264-2268.9COMS4761-- 2007 17What Is The Distribution of S ? Consider the scores S’(q,q’) for N random sequences q’ The score S’(q,q’)=Σk s(qk,q’k) is the sum of independent random variables For sufficiently long sequences

View Full Document


School:
Email:
New Password:
Confirm Password:

Columbia CS 4761 - Sequence Alignment

Sign up for free to view:

Please select your school