DOC PREVIEW
CMU BSC 03510 - Lecture
Pages 37

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Biology, Part 6 Sequence Database SearchingSequence Analysis TasksEfficient database searching methods“Hit and extend” sequence searchingSlide 5Word searching - hashingHashingSlide 8FASTAFASTA methodSlide 11Slide 12Comments on FASTALimitations of FASTASlide 15BLAST (Basic Local Alignment Search Tool)Slide 17BLAST basicsSlide 19Slide 20BLAST method for proteinsGenerating word listSlide 23Slide 24Slide 25BLAST Method for DNASlide 27Slide 28Slide 29BLAST Statistical significanceSlide 31Slide 32Slide 33Slide 34Slide 35Summary of Database Search MethodsReading for next classComputational Biology, Part 6Sequence Database SearchingComputational Biology, Part 6Sequence Database SearchingRobert F. MurphyRobert F. MurphyCopyright Copyright  1996-2009. 1996-2009.All rights reserved.All rights reserved.Sequence Analysis TasksSequence Analysis Tasks Given a query sequence, search for similar Given a query sequence, search for similar sequences in a databasesequences in a databaseGlobal or Local?Both local and global alignment methods may be applied to database scanning, but local alignment methods are more useful since they do not make the assumption that the query protein and database sequence are of similar length.Efficient database searching methodsEfficient database searching methodsDynamic programming requires order NDynamic programming requires order N22L L computations (where N is size of the query computations (where N is size of the query sequence and L is the size of the database)sequence and L is the size of the database)Given size of databases, more efficient Given size of databases, more efficient methods neededmethods needed“Hit and extend” sequence searching“Hit and extend” sequence searchingProblem: Too many calculations “wasted” Problem: Too many calculations “wasted” by comparing regions that have nothing in by comparing regions that have nothing in commoncommonInitial insight: Regions that are Initial insight: Regions that are similarsimilar between two sequences are likely to share between two sequences are likely to share short stretches that are short stretches that are identicalidentical Basic method: Look for similar regions only Basic method: Look for similar regions only near short stretches that match near short stretches that match exactlyexactly“Hit and extend” sequence searching“Hit and extend” sequence searchingWe define a We define a wordword (or (or kk-tuple) size that is the -tuple) size that is the minimum number of exact “letter” matches minimum number of exact “letter” matches that must occur before we do any further that must occur before we do any further comparison or alignmentcomparison or alignmentHow do we find all of the occurences of How do we find all of the occurences of matching words between a sequence and a matching words between a sequence and a database?database?Could scan sequence a word at a time, but this is Could scan sequence a word at a time, but this is order L (size of database)order L (size of database)Word searching - hashingWord searching - hashingSolution: Use a precomputed table that lists Solution: Use a precomputed table that lists where in the database each possible word where in the database each possible word occursoccursGenerationGeneration of the table is of order L (size of of the table is of order L (size of database) but database) but useuse of the table is of order N (size of the table is of order N (size of query sequence)of query sequence)The computer science term for this The computer science term for this approach is approach is hashinghashingHashingHashingHashingHashingHashing Table of size 10Hashing Table of size 10Hashing function H(x) = x mod 10Hashing function H(x) = x mod 10Applet:Applet:http://www.engin.umd.umich.edu/CIS/course.deshttp://www.engin.umd.umich.edu/CIS/course.des/cis350/hashing/WEB/HashApplet.htm/cis350/hashing/WEB/HashApplet.htmInsertion & SearchInsertion & SearchHashingHashing(Demonstration A10)(Demonstration A10)Demonstration: Hashing algorithm for sequence searchingAuthor: R.F. Murphy, Feb. 6, 1995 (revised Feb. 15, 1996)This demonstration takes a piece of database sequence, calculates hash values for eachktuple, builds a hash table (listing the positions in the database of the occurence of eachhash value), and uses a simplified version of the hash table to find the positions in the database sequence of the first occurence of each ktuple in a query sequence.database sequenceseq(i) seq(i) i as char as int hash value1 a 0 62 c 1 27 This section converts each base to a number 3 g 2 47 from 0 to 3 and combines those numbers three 4 t 3 63 at a time to form an integer from 0 to 63 that5 t 3 63 is unique for each three base sequence.6 t 3 60 Each three base sequence is called a "ktuple."7 t 3 488 a 0 09 a 0 010 a 0 111 a 0 612 c 1 2413 g 2 3314 a 0 415 c 1 1716 a 0 517 c 118 c 1hash first hitvalue pos1 pos2 pos3 hash table for the database sequence hash table0 a a a 8 9 81 a a c 10 102 a a g not found 3 a a t not found 4 a c a 14 145 a c c 16 166 a c g 1 11 17 a c t not found 8 a g a not found 9 a g c not found 10 a g g not found 11 a g t not found 12 a t a not found 13 a t c not found 14 a t g not found 15 a t t not found 16 c a a not found 17 c a c 15 1518 c a g not found 19 c a t not found 20 c c a not found 21 c c c not found 22 c c g not found 23 c c t not found 24 c g a 12 1225 c g c not found 26 c g g not found 27 c g t 2 228 c t a not found 29 c t c


View Full Document

CMU BSC 03510 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?