DOC PREVIEW
Stanford CS 374 - Lecture 8 - Index-based Search of Single Sequences

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh Index-based Search of Single Sequences Based on the following papers: 1. “BLAT—The BLAST-Like Alignment Tool”, W. James Kent 2. “Designing Seeds for Similarity Search in Genomic DNA”, Jeremy Buhler, Uri Keich, Yanni Sun Motivation The discovery of new genes is of particular interest to biologists. These new genes may provide them with more clues about the evolution of humans. One of the questions they try to answer is: Does the gene occur in other species? This isn’t a straightforward problem as genes evolve over time and undergo mutations. This provides the motivation for sequence alignment which is a technique to arrange sequences of DNA, RNA or protein to identify regions of similarity between them. AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Local Alignments Computational approaches to sequence alignment fall into two categories: local alignment and global alignment. Global alignment is a form a global optimization that forces the alignment to span the entire length of all query sequences. On the other hand, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments can be difficult to calculate because of the additional challenge of identifying the regions of similarity. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method based on dynamic programming. However, the Smith-Waterman algorithm is fairly demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required.CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh Lets say we have a gene that we want to compare to an existing genomic database, to check if it exists. This would result in a complexity of O(1015) using the Smith-Waterman algorithm. The GenBank database has seen an exponential growth in total sequence data. In 2002, the number of base pairs exceeded 1011. Considering the rate at which the total sequence data has been growing, we would like to achieve better results.CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh BLAST- Basic Local Alignment Search Tool As the Smith-Waterman algorithm is fairly demanding of resources, it has largely been replaced in practical use by the BLAST algorithm which is much more efficient. The main idea behind the BLAST algorithm is to construct a dictionary of all the words in the query. Given a query, we search through the database & then initiate a local alignment for each word match between the query and the database. The running time is O(mn) in the worst case. However, in practice it is orders of magnitude faster than Smith-Waterman. The steps involved are as follows: Step 1: A dictionary of query words is constructed. In the figure below, the query has been indexed by all words of size k = 4.CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh Step 2: All the ‘relatives’ of a query word are generated. A relative is a word with alignment score greater than a threshold, T. A single word can be a relative of multiple words. For instance, in the figure above, ATGC, ATGG and ACGC are identified to be relatives of ATGC. The index is then updated with this information i.e. all 3 relatives are updated to point to ATGC. Step 3: The database is searched linearly, one word at a time. Alignment is initiated with all occurrences of that word in the query.CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh The search through the database can also be optimized by using B-trees. Step 4: In this step, the alignment that was initiated in the previous step is extended to the left and right with no gaps until alignment falls a certain threshold below the best score so far. In the figure below, the matching word GGT initiates an alignment. This alignment is extended to both sides and the final output is GTAAGGTCC and GTTAGGTCC. The red box corresponds to the mismatch between A and T at the 3rd position.CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh Sensitivity-Speed Tradeoff For the problem of sequence alignment, sensitivity is defined as the fraction of alignments that are actually detected. In step 1 of the BLAST algorithm, the query is indexed by all words of a certain length k. The value of k has to be carefully chosen as it results in a sensitivity-speed tradeoff. For large values of k, the algorithm will be fast but will detect fewer alignments. On the other hand, for small values of k more alignments will be detected but the algorithm will be slower. So, as k increases – sensitivity gets worse, speed gets better. BLAT – BLAST-like Alignment Tool BLAT is a variation of the BLAST algorithm. Both the algorithms perform rapid scans for relatively short matches. However, BLAT builds an index for the database and scans linearly through the query. Hence, the index needs to be built only once. Recollect that BLAST builds an index of the query sequence and scans through the database. Also, unlike BLAST, BLAT allows for gaps in alignment extensions. Because of this, the alignments returned by BLAT are also larger. BLAT Strategies: (i) Single Perfect Matches In this model, no mismatch is allowed i.e. no relatives are generated. As expected, fewer matches will result for longer words. In the figure below, k represents the size of the perfect match. The y-axis shows how many perfect matches of this size are expected to occur by chance (specificity). The results are based on a genome of 3 billion bases using a query of 500 bases.CS374 Fall 2006 Lecture 8, 10/17/06 Lecturer: Omkar Mate Scribe: Abhita Chugh So, the number of matches reduces as k increases. In the figure below, the y-axis denotes the fraction of homologies detected or the sensitivity. The larger the value of k, the lesser the sensitivity.


View Full Document

Stanford CS 374 - Lecture 8 - Index-based Search of Single Sequences

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Lecture 8 - Index-based Search of Single Sequences
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 8 - Index-based Search of Single Sequences and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 8 - Index-based Search of Single Sequences 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?