Unformatted text preview:

CSE182-L4: Scoring matrices, Dictionary MatchingClass Mailing ListProtein Sequence AnalysisProtein sequence motifsPrositeBasic ideaZinc Finger domainProteins containing zf domainsFrom alignment to regular expressionsThe sequence analysis perspectiveProfilesScoring ProfilesPsi-BLAST ideaPsi-BLAST speedPowerPoint PresentationCSE182-L6Slide 17Slide 18Regular ExpressionsRegular ExpressionRegular Expression & AutomataExamples: Regular Expression & AutomataConstructing automata from R.ERegular Expression MatchingAlg. For matching R.E.Slide 26D.P. to match regular expressionSlide 28AlgorithmThe final stepA structural view of proteinsCS view of a proteinProtein structure basicsSide chains determine amino-acid typeBond angles form structural constraintsVarious constraints determine 3d structureAlpha-helixBeta-sheetDomains3D structureProtein DomainsSlide 42Domain reviewDatabases of protein domainsSlide 45PROSITESlide 47Slide 48http://hmmer.wustl.eduHMMER programsSlide 51Post-translational modificationSlide 53Protein targetingSlide 55Slide 56Slide 57Multiple alignment toolsTools for secondary structure predictionNext topic: Gene findingFa05 CSE 182CSE182-L4: Scoring matrices, Dictionary MatchingFa05 CSE 182Class Mailing List•[email protected]•To subscribe, send email to –[email protected]•You can subscribe from the course web page•Use the list for all course related queries, discussions,…Fa05 CSE 182Protein Sequence Analysis•What can you do if BLAST does not return a hit?–Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity.•A: Accept hits at higher P-value. –This increases the probability that the sequence similarity is a chance event.–How can we get around this paradox?–Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?Fa05 CSE 182Protein sequence motifs•Premise: •The sequence of a protein sequence gives clues about its structure and function.• Not all residues are equally important in determining function.•How can we identify these key residues?Fa05 CSE 182Prosite•In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999Fa05 CSE 182Basic idea•It is a heuristic approach. Start with the following:–A collection of sequences with the same function.–Region/residues known to be significant for maintaining structure and function. •Develop a pattern of conserved residues around the residues of interest•Iterate for appropriate sensitivity and specificityFa05 CSE 182Zinc Finger domainFa05 CSE 182Proteins containing zf domainsHow can we find a motif corresponding to a zf domainFa05 CSE 182From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE]• Search Swissprot with the resulting pattern• Refine pattern to eliminate false positives• IterateFa05 CSE 182The sequence analysis perspective•Zinc Finger motif–C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H –2 conserved C, and 2 conserved H•How can we search a database using these motifs?–The motif is described using a regular expression. What is a regular expression?–How can we search for a match to a regular expression? Not allowed to use Perl :-)•The ‘regular expression’ motif is weak. How can we make it strongerFa05 CSE 182Profiles•Start with an alignment of strings of length m, over an alphabet A, •Build an |A| X m matrix F=(fki)•Each entry fki represents the frequency of symbol k in position i0.710.140.710.28Fa05 CSE 182Scoring Profiles€ S(i, j) = fkik∑M rk,sj[ ]kisfkiScoring MatrixFa05 CSE 182Psi-BLAST idea•Multiple alignments are important for capturing remote homology.•Profile based scores are a natural way to handle this.•Q: What if the query is a single sequence.• A: Iterate:–Find homologs using Blast on query–Discard very similar homologs–Align, make a profile, search with profile.Fa05 CSE 182Psi-BLAST speed•Two time consuming steps.1. Multiple alignment of homologs2. Searching with Profiles.1. Does the keyword search idea work?•Pigeonhole principle again: –If profile of length m must score >= T–Then, a sub-profile of length l must score >= lT/m–Generate all l-mers that score at least lT/M–Search using an automaton •Multiple alignment:–Use ungapped multiple alignments onlyFa05 CSE 182Fa05 CSE 182CSE182-L6Regular Expression MatchingProtein structure basicsFa05 CSE 182Zinc Finger domainFa05 CSE 182The sequence analysis perspective•Zinc Finger motif–C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H –2 conserved C, and 2 conserved H•How can we search a database using these motifs?–The motif is described using a regular expression. What is a regular expression?Fa05 CSE 182Regular Expressions•Concise representation of a set of strings over alphabet .•Described by a string over•R is a r.e. if and only if€ Σ,⋅,∗,+{ }€ R = {ε} Base caseR = {σ},σ ∈ ΣR = R1+ R2 Union of stringsR = R1⋅ R2 ConcatenationR = R1* 0 or more repetitionsFa05 CSE 182Regular Expression•Q: Let  ={A,C,E}–Is (A+C)*EEC* a regular expression?–*(A+C)?–AC*..E?•Q: When is a string s in a regular expression?–R


View Full Document

UCSD CSE 182 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?