Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.1 6.047/6.878 Computational Biology Nov. 4, 2008 Lecture 18: CRFs for Computational Gene Prediction Lecturer: James E. Galagan Overview of gene prediction One of the fundamental problems in computational biology is to identify genes in very long genome sequences. As we know, DNA is a sequence of nucleotide molecules (a.k.a. bases) which encode instructions for generation of proteins. However, not all of these bases are responsible for protein generation. As an example shown in the 4th slide on page 1 of [2], in the eukaryotic gene structure, only exons contribute to protein manufacturing. Other segments, like intron, intergenic, start, stop, are not directly responsible for protein production. Therefore our task in gene prediction (or genome annotation) is, given a DNA sequence (X) with zero or more genes and some “evidence” associated with the given DNA sequence, determine a labeling (Y ) that assigns to each base a label according to the functionality of that part of the gene. For example, the labels can be intergenic, start, exon, acceptor, intron, donor, stop, etc. How do we do gene prediction? It turns out that we can make use of se veral types of evidence. For example, some gene parts are associated with short fixed sequences, which are called signals. However, since such short sequences appear randomly everywhere in the DNA sequence, we cannot solely use these signals for gene prediction purposes. Other evidence relies on the tendency of certain genomic regions to have specific base composition (a.k.a. content measures). For example, there are usually multiple codons for each amino acid and not all codons are used equally often, which gives rise to different ratios of nucleotides in coding sequences. Apart from this evidence, which stem from DNA properties, there is also evidence like direct expe rimental evidence, BLAST hits, HMMer hits, etc. The main challenge of gene prediction algorithms is how to take advantage of all these types of evidence together. The most popular method so far of combining evidence is to use Hidden Markov Models (HMMs). In an HMM, we model the labels as hidden states and assume that the hidden states constitute a Markov chain. We assign an emission probability to every (x ∈ X, y ∈ Y ) to model the probability of observing base x when the hidden state is y. This model is also called a generative model, since a convenient way to think of HMMs is to imagine that there is a machine with a button. By pushing the button, one can generate a ge nome sequence according to some probability distribution. In gene prediction, we use maximum-likelihood training to compute state transition probabilities and emission probabilities, and then find the most like ly hidden state sequence Y yn. Note that HMMs in fact model a joint distribution over = y1 · · · bases and hidden states, namely P (X, Y ) = P (Labels, Sequence). However, in gene prediction problems, we are given X and only need to predict Y . In other words, we want the conditional probability distribution, P (Y |X). There are also some so-called generalized hidden Markov models (GHMMs). In these models, each state may emit more than one symbol, emission probabilities may be modeled by any arbitrary probabilistic models, and sometimes the feature lengths are explicitly modeled according to experimental data (instead of restricted to geometric distributions as required by HMMs). As an example, the 2nd slide on page 2 of [2] demonstrates Genscan, a software based on GHMMs. The 3rd slide shows a comparison of performances (in 1terms of both Sensitivity and Specificity measures) of four different gene prediction softwares based on GHMMs. One can see that there is still room for improvement, and this is mostly due to the limitations of HMMs. 2 Why Conditional Random Fields (CRFs) There are certain intrinsic limitations to the HMM approach. First, as we mentioned ear-lier, the most natural distribution to model is the conditional distribution instead of the joint distribution. HMMs expend unnecessary effort to model the joint distribution P (X, Y ) = P (Y |X) ∗ P (X), when our real goal is just P (Y |X). What’s even worse is that during the learning process we might have to trade the accuracy in modeling P (Y |X) in order to model P (X) more accurately, and this leads to less accurate gene prediction. Another limitation is that since HMMs are directed graphical models, each component of HMMs has strict probabilistic semantics. Therefore it is hard for HMMs to model dependent evidence. However as we know in gene prediction, there are many dependent evidence that need to be incorporated. For example, HMMer protein domain predictions come from models based on known protein sequences and these sequences are the same proteins searched by BLAST. Therefore evidence coming from HMMer hits and BLAST hits are by no means independent. In fact, dependence is the rule for most evidence available for gene prediction. One major draw back of HMMs is that it is very cumbersome to model arbitrary dependent evidence of the input sequences. One may try to modify HMMs by adding more hidden states to model such dependency, but this approach usually leads to training and inference that is computationally intractable and/or suffers from data sparsity problems. One common strategy is to simply assume independence (naive Bayesian assumption) among the conditional probabilities. In other words, we would make the assumption that P (X|YiYi−1 . . . Y1) = P (X|Yi) , but this assumption is typically a false one in practice. All these difficulties with HMMs motivated people to propose an alternative approach known as discriminative models to model the conditional distribution P (Y |X) directly. Contrary to the directed graphical model of HMMs, this is an undirected graphical model (i.e. Markov random field). In the undirected graphical models, edges between nodes no longer bear proba-bilistic semantics. Instead, an edge simply represent some “compatibility function” (the f


View Full Document
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?