Columbia CS 4761 - Sequence Alignment

Unformatted text preview:

1COMS4761-- 2007Prof. Yechiam Yemini (YY)Computer Science DepartmentColumbia UniversityChapter 2: Sequence Alignment2.3 Searching Sequence Databases; FASTA, BLASTCOMS4761-- 2007 2 How to search a sequence database (DB) for local alignments of a query sequence? E.g., Search a promoter sequence in a DB of 105 sequencesDynamic Programming is prohibitively complexNeed techniques that are: Fast: focus search on likely solutions (trade speed for completeness) Tunable: retrieve meaningful alignments (ones with sufficiently high score)FASTA & BLASTThe ProblemTutorial: W.R. Pearson "Protein sequence comparison and Protein evolution Tutorial - ISMB2000" (October, 2001)http://www.people.virginia.edu/%7Ewrp/papers/ismb2000.pdfq=TACGAAT..ATAAGAATATACGAATCCACGAT..TCGATACGTTAGCAATACTAG…CGAAATATAGGTTAGCAATAC..ACGACATCGAAGAATAAATAT..……………..???acACGAATaTACGAATccACGA-T.._tACGAAT-TACGAAT-tACGAaT__2COMS4761-- 2007 3Reconsider DP Geometry Diagonal matching segments provide the basis for alignments Alignment may be viewed as connecting matching diagonals Using mismatched diagonals or horizontal/vertical gapped segments Scoring is additive contributions of matching diagonal and connectors Mismatched diagonals or vertical/horizontal gapped segments may reduce the score It is best to focus on high scoring diagonals and use connectors with positive scoreCOMS4761-- 2007 4Dot Matrix HeuristicsRule 1: Find high-scoring diagonals Search small diagonal segments Extend to max diagonal matches Connect diagonals to max scoreRule 2: Focus on meaningful alignments Filter low-scoring diagnonals* ** **** * *** * ** ** * ** * ** * ** ** ** * *** ** **3COMS4761-- 2007 5Tradeoff: Time vs. OptimalitySmith-Waterman FASTA Blast10 min 2 min 20 secCOMS4761-- 2007 6FASTAKey idea (Pearson & Lipman 88): Find short diagonals by indexing the DB Extend these to high scoring diagonals Use DP to connect themA 4 steps process4COMS4761-- 2007 7Step (a): Find Diagonal Matches by Indexing Key idea: create an index of k-tuples of the DB Scan database to index k-tuples [k=1..5] Scan query to index k-tuples Find all diagonal matches of length k by comparing the hash tables Merge these short diagonals into maximal diagonal matches Example:Database d: TATCGATCGAPosition:1 2 3 4 5 6 7 8 9 10Query q: GATCGPosition:1 2 3 4 5TATCGATCGAGATCGTATATCTCGCGAGAT12,63,74,85dKtup=31. Extract index231q0,-40,-4-4Offset=q-d2. Find matches 3. Merge diagonal matchesCOMS4761-- 2007 8FASTA Steps (b-d): Optimize Scoreb) Filter low-score diagonalsc) Extend diagonals to maxscore; keep high-scoringsegmentsd) Use DP for a narrow bandaround the high scoringsegments5COMS4761-- 2007 9ExampleCOMS4761-- 2007 10BLAST: Basic Local Alignment SearchAltschul & Karlin [1990]; a family of algorithmsIdea: find matches with significant score statistics Find maximal segment pairs (MSP): segments with significant score Based on extensive statistical theory (summarized soon)Base Algorithm: Step 1: index DB for words of size W (W-mers); index query sequence for W-mers with score >Thresholdo W= 3 for protein, 11 for nucleotides Step 2: search for matches with high score (HSP=high scoring pairs) Step 3: extend hits to maximal score segments Step 4: report matches with score above S6COMS4761-- 2007 11BLAST Step 1-3: Finding Short High-Scoring Pairs (HSP) Create an Index of W-mers for database & query For proteins W=3 means a dictionary of 203=8000 words Match W-mers that score above a threshold T FASTA searches for exact matches of ktuples BLAST, in contrast, searches for high scoring pairs (HSP) Key idea: exploit the fast part of the search to max the scorerather than push the maximization for later, slower, phasesFrom A. Baxevanis: “Nucleotide and Protein Sequence Analysis I”via Kellis & Indyk, MIT, “BLAST & Database Search, Lecture 2”COMS4761-- 2007 12Blast Steps 3-4: Extending Short HSPsThe short HSPs are extended to increase the scoreMax score extensionScoreReport above threshold HSPs and their scores7COMS4761-- 2007 13ExampleCOMS4761-- 2007 14Statistics Background How do we distinguish “meaningful” alignment from a random one? E.g., suppose an alignment of a query q with a sequence d scores s, is it “meaningful”? If s is much higher than the average score of a random alignment, the answer is positive Key idea: use statistics of alignment scores to distinguish “meaningful” Basic probability: Given a sample space (e.g., possible alignments) A random variable is a real-valued function of samples (e.g., alignment score) The statistics of a random variable is described by a distribution function:F(x)=P[S <x ] with a density function f(x)=dF/dx…. f(x)Δx~p[x-Δx≤S≤x +Δx]x µσσf(x)(Score)Standard deviation, σ=√[E(S2)-E(S)2]Mean, µ=E(S)= sf(x)dx1 F(x)8COMS4761-- 2007 15Distinguishing Meaningful From Random “meaningful” ~ score is at the tail of the distribution Z-Score: Z(s)=(s-µ)/σ [z(s) >7  d is meaningful] P-Score: p=p[S>s]=1-F(s) [p<0.02 d is meaningful]s µσσf(s)F(s)s Z=(s-µ)/σ1 p=1-F(s)Meaningful ~ tail scoreCOMS4761-- 2007 16How Significant is An Alignment? Key idea: Consider the highest matching score S as a random variableand use Z-score to determine whether an alignment is meaningfulStrategy: Define the highest matching score S as a random variableo Let q be a query and d a sequence from a database Do Define S(q,d)=Max{ s(q,q’)| where s(q,q’) is the score of an alignment of q and a subsequence q’ of d}o S is a random variable defined over the space of local alignments {(q,d)} Suppose S has mean µ and standard deviation σ Use Z-score, Z(s)=(s-µ)/σ to determine significance of a local alignment scoring s For protein sequences Z(s) >7 is considered significantKey challenge: how do we determine the distribution of S ?Karlin, S., and Altschul, S. F. (1990) ``Method for assessing the statistical significance of molecular sequence features by using general scoring schemes,'' Proceedings of the National Academy of Science, USA 87, 2264-2268.9COMS4761-- 2007 17What Is The Distribution of S ? Consider the scores S’(q,q’) for N random sequences q’ The score S’(q,q’)=Σk s(qk,q’k) is the sum of independent random variables For sufficiently long sequences


View Full Document

Columbia CS 4761 - Sequence Alignment

Download Sequence Alignment
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Sequence Alignment and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sequence Alignment 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?