DOC PREVIEW
Berkeley COMPSCI 294 - Lecture Notes

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 294-8 Computational Biology for Computer Scientists Spring 2003Lecture 14: March 6Lecturer: Kimmen Sj¨olander Scribe: Jonathan CarterDisclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.They may be distributed outside this class only with the permission of the Instructor.14.1 Function Prediction in Protein SuperfamiliesThe structure of proteins can be thought of as having the following hierarchy:• a primary structure which is just the sequence of amino acid residues,• a sec ondary structure of α-helix, or β-sheets which describes the local structure,• and a tertiary structure composed of loops or turns that connects the underlying α and β structures.Proteins can have similar function, but show quite different structures due to small mutations in the sequence.Predicting protein function is a difficult problem, typically 40% of genes for a given genome cannot beclassified with current algorithms.Most genes and proteins have b e en assigned a function by homology. In the simplest case, by using BLASTand checking for significant hits in the protein database, the function of the query protein is assigned to bethe same as that of best hit in the database. The accuracy of this method is not good in many cases.14.2 Homology recognition and prediction by HomologyThe fundamental hypothesis of prediction by homology is that conserved structure implies conserved function.Both structure and function are conserved by common sequence at key positions, but in general which piecesof the se quence are most important is unknown. This presents a problem to simple pairwise sequencecomparison.In searching for members of protein family (or members of the tiger family in Prof. Sj¨olander’s slides) allfeatures must be assumed a priori to be e qually informative. Without knowing what the important featuresare, relatives might not be recognized. By gathering together known family members and creating a profile,remote relatives can be identified.14.3 Including prior information in profile or HMM constructionGiven a small training set, how to estimate the probabilities of events that are not present in the trainingset is a problem. For example, if an amino acid is not present at position 1 in any of the training set, whatis the probability that it will occur? The position may be critical and may not allow for any substitutions.14-114-2 Lecture 14: March 6However, if we have a large training set and a position that is very well conserved, the probability of findingthat particular amino acid in that location should be fixed as very high, since we have considerable evidence.A good model would be able to interp olate b e tween reliance on prior data, and confidence in new data. Sucha model is given by using Dirichlet mixture densities.A Dirichlet density ρ is a probability density over the set of all probability vectors ~p given by:ρ(~p) =Πpαi−1iZwhere ~α are the Dirichlet parameters and Z is a normalizing constant. Proteins have a 20 letter alphabet,with pi= Prob(amino acid i). The mean value of piis αi/ |~α|. A Dirichlet mixture with l components hasthe formρ = q1ρ1+ q2ρ2+ . . . + qlρlwhere each ρiis Dirichlet density and the parameters qisum to 1.As observations are added into the model, the probability vector changes according to:~pi=lXj=1Prob( ~αj|~n, Θ)ni+ αj,i|~n| + | ~αj|where n is the vector of observations data, and the term Prob( ~αj|~n, Θ) is interpreted as the probability ofhow likely each individual component of the Dirichlet mixture density is to have produced the data. Thisformula has the limiting form of ~pi= ni/ |~n| when a large amount of data has been collected.For example, a nine component mixture was optimized using the Expectation Maximization (EM) algorithmusing the BLOCKS alignment database in [S96]. The nine components can b e thought of as modelingthe distributions of amino ac ids in nine different contexts. These need not correspond to actual physicalenvironments, or subfamilies, of proteins, but provide flexibility to describe these. Studies using this m ethodof estimating amino acid distributions leads to fewer false positives and false negatives.For exam ple, the trypsin protein in Fusarium Oxysporum (fungus) has only 16% identity with exfoliativetoxin A in Staphylococcus aureus (bacteria), yet the structures are very similar. For this kind of distantrelative detection, the critical positions must be known, and also the positions where deletions and mutationsare allowed.14.4 Phylogenetic AnalysisAfter detecting a homolog, it is important to determine the phylogenetic context. This is bec ause overevolutionary timescales gene duplications have produced families of related genes and proteins that can havepotentially different functions.Phylogenomic analysis consists of predicting a function for a protein in the context of a protein family.Typically, a multiple sequence alignment is done, then a phylogenetic tree is constructed and overlayed withavailable information.In the past, phylogenetic trees were constructed based on observable characteristics of organisms, now theyare based on conserved genes and proteins. Comparing the construction of phylogenetic trees for organismsand for protein families, for organisms genes are typically well conserved and are easily aligned, but for proteinfamilies there may be duplications and protein structure and function may have diverged considerably.Lecture 14: March 6 14-3Bayesian evolutionary tree estimation (BETE) is a method of constructing a phylogenetic tree based on amultiple alignment of a protein family. It is similar to a standard clustering algorithm that recursively joinstogether ‘closest’ classes. In this case distance is defined as the relative entropy between distributions. Theamino acid distributions are obtained using HMMs and the method of Dirichlet mixture priors.14.5 Hidden Markov Model parameter estimationHMM were originally used in speech recognition, and have been used in computational biology for DNAmodelling, multiple sequence alignment and identification of related family members.Typically, HMM paramter estimation is done through unaligned training sequences using Expectation-Maximization (EM) parameter estimation:1. Align sequences to HMM2. Gather statistics of paths taken through HMM3. Modify HMM parameters to maximize Prob(data | model)then iterate steps 1-3 until parameters converge.For re mote homolog detection, need


View Full Document

Berkeley COMPSCI 294 - Lecture Notes

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?