Stanford CS 374 - Lecture 14 - Biological Data Mining - D1706691

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Lecture 14 - Biological Data Mining

DOC PREVIEW

Stanford CS 374 - Lecture 14 - Biological Data Mining

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Biological Data Mining CS374 Fall 2006 Lecture 14, 10/31/06 Lecturer: Rashmi Raj Scribe: Siddharth Jonathan J.B. Biological Data Mining Based on the following paper: 1. Predicting Post-synaptic activity in Proteins with Data Mining. Gisele L. Pappa, Antony J. Baines and Alex A.Freitas. Additional references: 1. Fayyad U.M., Piatetsky-Shapiro G. and Smyth.P (1996) “From Data Mining to Knowledge Dis-covery: An Overview” 2. Devos D. and Valencia A. (2000) Practical Limits of Functional Prediction. Proteins, 41, 98-107 3. Gerlt, J.A and Babbit P.C (2000) Can sequence determine function? Genome Biol., 1(5), reviews 0005.1-reviews 0005.10. 4. Quinlan J.R. (1993) C4.5: Machine Learning Programs: Morgan Kaufmann. 5. Nagl S.(2003) Function Prediction from Protein Sequence. In Orengo, CA., Jones, D.T. and Thornton J.M.(eds), Bioinformatics:Genes, Proteins and Computers, Bios pp. 65-79. 6. Hulo N. et al.(2004) Recent improvements to the PROSITE database. Nucleic Acids Res. 32, D134-D137. 1. Introduction Predicting the functions of proteins based on their sequence, has always been an area of active research. This would help in obtaining a better understanding of diseases, design-ing more effective drugs etc. Although a vast amount of data is available in protein data-bases, there is still a long way to go with respect to understanding the mapping between sequence and function. This paper focuses on one instance of this problem, namely to predict whether or not a protein has post-synaptic activity. One of the primary challenges is to identify the features that contribute to determining whether or not a protein has post-synaptic activity. Although predictive accuracy is important, the focus here is also on the interpretability of the rules by biologists and the identification of novel and unknown rules. 2. Post Synaptic Activity – Background A synapse is a point where two nerve cells communicate with each other through the transmission of a chemical known as a neurotransmitter. As shown in Figure 1, multiple types of proteins are expected to be found at these sites for reception and propagation of signals. The cells are held together by adhesion. The neurotransmitters are stored in bags called synaptic vesicles. The synaptic vesicles then fuse with pre-synaptic membrane and release their content into the synaptic cleft. The Post-synaptic receptors then recognize this as a signal and get activated which then transmit the signal on to other signaling components. Within the post-synaptic cell, the signaling apparatus is organized by vari-ous scaffolding proteins. 3. The Problem The problem being tackled here is that of predicting whether or not a protein has post-synaptic activity. This is a problem that evinces great interest since proteins with postBiological Data Mining CS374 Fall 2006 Lecture 14, 10/31/06 Lecturer: Rashmi Raj Scribe: Siddharth Jonathan J.B. synaptic activity are connected with the functioning of the nervous system. They would help us understand diseases like the Alzheimer’s disease, Wilson’s disease etc better. 4. Data Mining – A Background: Data Mining is the science of discovering interesting patterns in large amounts of data. Figure 1: An Overview of Post-synaptic activity Data-base Data Mining Learnt Patterns Decision Support Figure 2: The Data Mining PipelineBiological Data Mining CS374 Fall 2006 Lecture 14, 10/31/06 Lecturer: Rashmi Raj Scribe: Siddharth Jonathan J.B. Data Mining finds its application in many diverse fields like Science, Web, Government etc. The knowledge gained out of data mining helps in making complex decisions both in terms of realizing when such decisions need to be made and in terms of validating the ra-tionale behind such decisions. 5. Data Mining – Example The most common example of data mining as applied in the world today is that of the market basket problem. The data that we analyze here is the market baskets in transac-tions. The buying patterns of users, in terms of which products are bought together is what we are interested in here. One can imagine the commercial utility of such an exer-cise. For example, if Bread and butter are found to go together in many market baskets, one can imagine placing them side by side in an aisle or even maybe farther apart to in-duce the customer to walk through the store more and possibly purchase more items. This sort of pattern finding can be done with a common data mining algorithm called the Ap-riori Algorithm. (As an aside, one interesting pattern that was revealed during data min-ing was the occurrence of beer and diapers together in many baskets!) 6. Data Mining – Why Important? With the increase in the volume of data being generated, it has become necessary to per-form data mining to identify interesting patterns, trends and for decision support. There are myriad application domains where data mining has become indispensable, namely the Web, banks, e-commerce, astronomy, biology etc. It is quite obvious that the human brain cannot process these large volumes of data or search for complex multi-factor de-pendencies in data. 7. Some Common Applications of Data Mining The following are some popular types of data mining approaches. 1. Learning Association Rules 2. Learning Sequential Patterns 3. Classification 4. Numeric Prediction 5. Clustering 8. Decision Trees A decision tree is a commonly used data mining algorithm and is used as a predictive model. In its simplest form, it takes as input, a set of examples and attempts to classify them into one of 2 categories. The inputs are represented by a set of predictor attributes. A standard set of example inputs is shown in the table below.Biological Data Mining CS374 Fall 2006 Lecture 14, 10/31/06 Lecturer: Rashmi Raj Scribe: Siddharth Jonathan J.B. Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No

View Full Document