DOC PREVIEW
Stanford CS 262 - Lecture 4

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS262 Lecture 4 (Jan 13 2005)Scribed by Shrikanth ShankarThis lecture completes sequence alignment. The last lecture covered the algorithm tocompute optimal alignment using the affine gap model through a simple extension of theDP algorithm for constant gap model. It uses additional matrices to track alignments withgap's in sequence X & in sequence Y. It also covered the linear space alignment algorithmwhich computes the trace back from the middle column by starting from both the startand the end. We then compute the full path by repeating this on the smaller problems tothe left and right of the portion of the path we have computed. (a divide and conquerstrategy). N-k*M/2M/2k*We also covered the Four Russian algorithm which divides the alignment matrix intoblocks and then uses precomputed matrices to actually fill in the alignment matrix. Thematrices are precomputed for every possible offset vector and subsequence. This material on dynamic programming is tested in the assignment out today which is duetwo Thursday's from today. There is a late submission policy listed on the web but noassignments will be accepted more than a week from the due date.Today we will cover heuristic local alignment tools which are the most popular tools. Thealgorithms we have covered are too slow to apply directly on today's genome database. The NIH is planning to sequence a number of genomes in the next year or two. A list canbe found at <http://www.genome.gov/1005141> where ongoing and completedsequencing projects are listed. The amount of data available can be found at<http://www.cbs.dtu.dk/databases/DOGS/> where we can look at the sizes of variousgenomes. Every year we are getting a lot more data. We are doubling the data about every3 years and this will speed up. The NIH is investing in technologies which will makesequencing much cheaper (maybe 1000$ for the human genome in 5 to 10 years). One ofthe interesting facts about genome sizes is that the longest genome on earth is actually antttamoeba (size 600 Giga bases). These genomes are mostly repetitive and every fewgenerations the genome doubles. The state of biological databases as of today is aboutMammals: ~25,000Insects: ~14,000Worms: ~17,000Fungi: ~6,000-10,000Small organisms: 100s-1,000sEach known or predicted gene has one or more associated protein sequences>1,000,000 known / predicted protein sequencesBasically there are a lot of gene's and many of them produce multiple proteins. We havemillions of proteins and we would like to align them all with each other. Given a newgene we may wish to try to align it with the entire database. With Smith Waterman thiscan 10^16 . Similarly given a complete genome we may wish to align it with ourdatabase. This is an even bigger problem.Indexing based local alignmentTo solve this we use a simple idea of index based alignment. The original and one of themost popular algorithm to do this is called BLAST. The idea is simple. Build a dictionaryof all the constant sized words that occur in the database. To do this build a table of allthe possible words. Then in each line in this table (i.e. for a particular word) store all thestart positions of this word in your database. Then given a query scan this query positionby position and compute the word in this position. Then find all occurrences of word inthis database. This scan takes time linear in the length of the query. Alternatively (as inthe original BLAST) the query was indexed and the database was scanned. This can savesome memory (the query is smaller) and scanning the database in linear time is not sobad. Once you find a match in the database you extend it left and right. This can take ahuge amount of time if there are a large number of matches. In real databases some wordsare very common while others are rare and this technique can be modified to ignorematches with common words. The typical algorithm builds a dictionary of words of size (k) and then scans the query. Inthe original BLAST alignment is started if a position aligns with a word in the databasewith a score > T. This is easy to do since for a given word length there are only a smallset of words that align with this word with a score > T and this can be precomputed.Once alignment is initiated we extend to the left and right to see if this is a goodalignment or if it is a spurious match. This technique can result in us missing goodalignments if we never find a match because the word length we choose is too large forthe matches we find. For example if we use a word length 10 and there is an alignment ofsize 500 in which every 5th letter is different. Extension AlgorithmsFor extensions we disallow gap's and extend the match left and right until the score fallsbelow a threshold of the best score so far. This is shown in the diagram below. This isvery fast (linear time).A C G A A G T A A G G T C C A G TC C C T T C C T G G A T T G C G AAnother possibility is that instead of doing exact extensions we look at at a narrow bandaround the diagonal until the score falls below a threshold. This is shown below.A C G A A G T A A G G T C C A G TC T G A T C C T G G A T T G C G AThe most common extension technique is to extend it left and right without using a fixedband. We continue extending the DP matrix until we fall a threshold below the currentbest square. Usually the threshold is a constant number. These are the main techniques for extending once you find a match.A C G A A G T A A G G T C C A G TC T G A T C C T G G A T T G C G AImproving Indexing based techniques.There has been research in recent years on improving indexing based techniques. Themain issue with these techniques is the trade off between sensitivity (finding a goodalignment if it exists) versus speed. Increasing word length speeds the alignment processbut reduces the chance of finding a good alignment. Dropping the word size to say 7improves our chances of finding a match but increases the chance of false matches andthus reduces the speed. Numbers for speed sensitivity trade off can be found belowThese discuss a region of length 100 and we analyze how often it contains a sequence oflength 7, 8, all the way to 14. It also discuss how often a 500 sequence random querycontains a match to a database. Given this we wish to improve sensitivity while keeping speed.


View Full Document

Stanford CS 262 - Lecture 4

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Lecture 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?