Unformatted text preview:

MIT OpenCourseWarehttp://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.1. Mitochondria play an important role in metabolic processes and are known as the “powerhouse” of the cell. They represent an ancient bacterial invader that has been successfully subsumed by eukaryotic cells. Mitochondria have their own genetic material, commonly referred to as mtDNA. Interestingly, a large fraction of inherited mitochondrial disorders are due to nuclear genes encoding specific proteins targeted to the mitochondria and not due to mutations in mtDNA itself. Hence it is imperative to identify these mitochondrial proteins to shed light on the molecular basis for these disorders. 6.047/6.878 Fall 2008 Lecture #5 1. Overview In the previous lecture we looked at examples of unsupervised learning techniques, such as clustering. In this lecture we focus on the use of supervised learning techniques, which use the inherent structure of the data to predict something about the state or “class” we should be in. There are two different approaches to classification which use either – (1) generative models that leverage probabilistic models that generate data, and (2) discriminative models that use an appropriate function to tease apart the data. Naïve Bayes’ classifiers are an example of generative models and Support Vector Machines (SVMs) are example of discriminative models. We will discuss actual biological applications of each of these models, specifically in the use of Naïve Baye’s classifiers to predict mitochondrial proteins across the genome and the use of SVMs for the classification of cancer based on gene expression monitoring by DNA microarrays. The salient features of both techniques and caveats of using each technique will also be discussed. 2. Classification – Bayesian Techniques We will discuss classification in the context of the problem of identifying mitochondrial proteins. If we look across the genome, how do we determine which proteins are involved in mitochondrial processes or more generally which proteins are targeted to the mitochondria1. This is particularly useful because if we know the mitochondrial proteins, we can start asking interesting questions about how these proteins mediate disease processes and metabolic functions. The classification of these mitochondrial proteins involves the use of 7 features for all human proteins – (1) targeting signal, (2) protein domains, (3) co-expression, (4) mass spectrometry, (5) sequence homology, (6) induction, and (7) motifs. In general, our approach will be to determine how these features are distributed for objects of different classes. The classification decisions will then be made using probability calculus. For our case with 7 features for each protein (or more generally, each object), it is likely that each object is associated with more than one feature. To simplify things let us consider just one feature and introduce the two key concepts. (1) We want the model for the probability of a feature given each class. So the features from each class are drawn from what are known as class-conditional probability distributions (CCPD).6.047/6.878 Fall 2008 Lecture #5 – Scribe Notes  󰇛 |󰇜  󰇛 | 󰇜 󰇛󰇜󰇛󰇜(2) We model prior probabilities to quantify the expected a priori chance of seeing a class. For the case of identifying mitochondrial proteins, we would like to have a low prior probability of seeing the “mitochondrial” class. In this way we can bias ourselves against classifying mitochondrial proteins. (But how do we actually determine these prior probabilities? We will cover this in section 2.2). A good example of how our second concept affects classification was discussed in lecture. We discussed the example of Eugene. Eugene likes to read books in philosophy and has a penchant for tweed jackets. The question posed was whether Eugene is more likely (in probabilistic terms) to be an ivy-league dean or a truck driver. The answer was that he was more likely to be a truck driver, due to the fact that there are many more truck drivers in the world than there are ivy-league deans. This overwhelming probability outweighs the fact that ivy-league deans are more likely to enjoy books on philosophy and tweed jackets than truck drivers. This prior knowledge influences our decision in classification and we can use this use prior information as the probability of belonging to a certain class P(Class) to make our classifications. Now we that we have the prior P(Class) and likelihood of a given feature, P(feature | Class), we can simply apply Bayes’ rule to determine the posterior P(Class | feature) since we are interested in classification based on features. evaluate evidence belief before evidence belief after evidence 󰇛 󰇜󰇜  󰇛 | 󰇜  | 󰇛 󰇜evidence posterior likelihood prior This is the application of Bayes’ rule to our problem with the probabilities defined as shown above. More generally we can think of the probabilities as below,  󰇛 1 | 󰇜  󰇛  2 | 󰇜  󰇛 |  1󰇜 󰇛 1󰇜 󰇛 |  2󰇜 󰇛 2󰇜 We can now think about applying this recursively across classes and using the “belief after evidence” to form new “evidence” and then find new “beliefs after evidence”. Once we’ve determined all our “beliefs after evidence” for each class, we can use a decision rule to choose a particular class. Given two classes, Bayes’ decision rule would be as follows,  2belief after evidence evaluate evidence belief before evidence6.047/6.878 Fall 2008 Lecture #5 – Scribe Notes 󰇛󰇜log󰇛 |  1󰇜  󰇛 1󰇜󰇛 |   2󰇜  󰇛 2󰇜We can take log probabilities to build a discriminant function as below, 0


View Full Document
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?