Stanford CS 262 - Sequence Alignment Continued - D2410145

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Sequence Alignment Continued

DOC PREVIEW

Stanford CS 262 - Sequence Alignment Continued

School name Stanford University

Course Cs 262- Computational Genomics

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Sequence Alignment Continued CS262 Winter 2005 Lecture 5, 1/13/05 Lecturer: Serafim Batzoglou Scribe: Daniel Woods 1Sequence Alignment Continued 1 Previous Lecture Review First, several topics were reviewed from the preview lecture. 1.1 Needleman-Wunsch with Affine Gaps The technique of affine gaps allows calculations to perform alignment scoring without a strict linear score or gaps. Finding the best scoring alignment with this technique requires multiple arrays to be kept because the maximum score at any midpoint is not purely deterministic on the predecessors (which would require only one array) rather, multiple possible scores must be kept for each location which will then be resolved by later information. The affine gap method specifically has two parameters for scoring gaps. There is a gap “open” penalty incurred once for each gap, regardless of the gap length. There is also a gap “extension” penalty which is multiplied by the length of each gap. This generates a 2-part piecewise model for gap scoring. This method can be easily extended to handle more complicated piecewise models, although it will require even more memory space. 1.2 Linear-space Alignment Algorithm This is a divided-and-conquer method of performing the Needleman-Wunsch algorithm. The basic premise is to first map only the midpoint of one sequence to its place in the other, and then to treat each half as a similar problem to repeat recursively. Figure 1 depicts this process graphically and it can clearly be seen that the problem greatly reduces in size with this method because large portions of the problem do not need to be calculated.Sequence Alignment Continued CS262 Winter 2005 Lecture 5, 1/13/05 Lecturer: Serafim Batzoglou Scribe: Daniel Woods 2 Figure 1 – Graphical depiction of the Linear-space Alignment Algorithm, which greatly reduces the computational requirements of the Needleman-Wunsch algorithm 1.3 The Four-Russian Algorithm This is a divided-and-conquer method of performing the Needleman-Wunsch algorithm. The basic The Four-Russian Algorithm was covered only conceptually in the previous lecture, but summarized here in review. Basically, the entire Needleman-Wunsch grid is divided into “t-blocks” (named for the variable corresponding to the length of a side). The key observation between the algorithm is that the rightmost column and bottom row of each t-block are deterministic given the leftmost column and top row of the t-block. This allows the edges of all t-blocks in an entire grid to be calculated very quickly using a lookup table of possible t-blocks. This way, no calculations need to be done involving the spaces inside of the t-blocks at this stage. Once that is complete, it can quickly be found where the best alignment enters and exists each t-block is crosses, and those t-blocks can then be better analyzed to find the exact alignment.Sequence Alignment Continued CS262 Winter 2005 Lecture 5, 1/13/05 Lecturer: Serafim Batzoglou Scribe: Daniel Woods 3 Figure 2 - Graphical depiction of the Four-Russian Algorithm which further reduces the computational requirements of an allignment 2 Heuristic Local Aligners 2.1 Background The algorithms covered so far are unable to be applied as discussed to today’s large genomic databases. The amount of known genomic data is roughly doubling each year, and this trend has been going steadily for 15 years. Enormous efforts are underway to severely reduce the cost of sequencing so this rate maySequence Alignment Continued CS262 Winter 2005 Lecture 5, 1/13/05 Lecturer: Serafim Batzoglou Scribe: Daniel Woods 4even accelerate in coming years. More heuristic approaches are therefore necessary to handle the enormous amounts of data. When a new genome is sequenced, what would be extremely useful to a biologist would be the ability to compare the new genome against the entire genome database in order to find similar genes. The numbers of genes typically found in some classifications of life forms are as follows: • Mammals: ~25,000 • Insects: ~14,000 • Worms: ~17,000 • Fungi: ~6,000 – 10,000 • Small Organisms: 100s – 1,000s Finding the best mappings between genes throughout the entire genome database requires some new techniques. 2.1 Indexing-based Local Alignment with BLAST (Basic Local Asignment Search Tool) 2.1.1 Dictionary The basic premise behind this methodology is to create a dictionary of “words” from across the entire genome database. This calculation is performed one time for a particular database, and can then be used for many searches. Here, a “word” refers to a very short subsequence. With such a dictionary, a lookup can be performed on any word and it will immediately return all locations in all genomes of the database where this word is found. Some immediate observations can be made about word length as it affects the usefulness of this algorithm. If the word length is too short, the dictionary will show many hits that are not actually well aligned sequences and sifting through them to find the correct one will require significant computation. If words are too long, there may be good alignments that are missed because the word does not appear without a substitution occurring within it. Another important issue to consider is that not all words occur with equal regularity. Some words are far more common than others with a very predictable distribution. This make it much more difficult to “even out” the dictionary so that each word contains roughly the same number of matches in order to smooth out and maximize the worth length parameter. One approach is to simply exclude from the dictionary words which are particularly repetitive so that matches returned by the dictionary (from those that ARE still in it) are more likely to correspond to real matches. 2.1.2 Alignment For a particular match returned by the dictionary, the goal is to score the alignment using the subsequences located just before and after the query word in both the query sequence and the returned position in the genome database. Using the Needleman-Wunsch scoring methods, there are still several ways to find the best scoring matches.Sequence Alignment Continued CS262 Winter 2005 Lecture 5, 1/13/05 Lecturer: Serafim Batzoglou Scribe: Daniel Woods 5 Figure 3 - A naive approach to extending to

View Full Document