Mining Medical LiteratureOutlineSlide 3IntroductionThe ProblemWhat is Data Mining?Example Data!Amazon.comGoogle NewsMore ApplicationsInformation Retrieval (IR)Simple flow of Retrieval ProcessIR System EvaluationPrecision and RecallProblems with Precision and RecallSensitivity and SpecificitySlide 17Slide 18HOVERGEN: a Database of Homologous Vertebrate GenesWhy identify functional gene groups?Existing ApproachesStatistical NLP approachNeighbor Divergence ApproachChallenges in the ProblemNeighbor Divergence IntuitionNeighbor Divergence AlgorithmND- Article RepresentationND – Identifying Semantic NeighborsND – Scoring articlesND – Difference in DistributionsObserved and Expected Distribution of Article ScoresResultsOther methodsOther methodsEvaluationCorrupting Functional GroupsSlide 37Slide 38AdvantageExisting approachesInformation Extraction and Machine LearningML techniquesApproach Used hereUnsupervised Learning – Contextual SimilaritySlide 45Contextual SimilarityPartially supervised Learning- SnowballSnowballSupervised Learning – Text classificationHand Crafted Extraction System- GPE systemCombined SystemFinal parameters used for the different systemsRunning TimesResults and EvaluationSlide 55Slide 56Conclusion and Future WorkSlide 58Slide 591Mining Medical LiteratureVignesh Ganapathy(CS 374 : Algorithms in Biology)(FALL 2005)2 OutlineIntroduction and BackgroundMining Technique 1: Identifying Functionally Coherent Gene GroupsMining Technique 2:Extracting Synonymous gene and protein termsConclusions3 OutlineIntroduction and BackgroundMining Technique 1:Identifying Functionally Coherent Gene GroupsMining Technique 2:Extracting Synonymous gene and protein termsConclusions4 IntroductionMedical Literature has vast amounts of knowledge and informationPubMed Central (PMC) ( the U.S. National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature)Amedeo.com (The Medical Literature Guide)Journals like Science, Nature, Cell ,EMBO, Cell Biology, PNAS (and many more..)5 The ProblemMajor task is finding out ways to extract useful information from these resources.6 What is Data Mining? “Data Mining is the Process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques.”7 Example Data!Large amounts of data but no informationDaily transactions at a supermarketDaily website visit historiesBooks/videos rented at a LibraryNewspaper, Journal archives8 Amazon.com9 Google NewsClustering News items (Google News)10 More ApplicationsImproving Sales strategyFinding items that sell together(there is a common example of beer and diaper being related. A supermarket found out that 50% of the times beer was purchased with diapers)Anomaly Detection and many more…11 Information Retrieval (IR)Collecting information from text data (Unstructured Data)ApplicationsSearch web documentsNatural Language ProcessingTerm also extends to include multimedia or other forms of unstructured data12 Simple flow of Retrieval Process13 IR System Evaluation Some measures are Precision RecallF1 measure – Combined measure which is a weighted harmonic meanSensitivitySpecificity14 Precision and RecallHow are Precision and Recall related?15 Problems with Precision and RecallDeciding documents relevant and non relevant is not easyFor recall, difficult to measure the number of relevant documents in databaseCreating pool of relevant records is one solutionIn practice, these are still good measures16 Sensitivity and SpecificitySensitivity – Probability of positive examples Specificity – Probability of negative examplesWhat is the relation between Sensitivity, Specificity, Precision and Recall?17 OutlineIntroduction and BackgroundMining Technique 1:Identifying Functionally Coherent Gene GroupsMining Technique 2:Extracting Synonymous gene and protein termsConclusion18 IntroductionAnalysis shifting from single gene to family of genesExamples of these are:Sequence DataGene Expression ClusteringDeletion PhenotypesYeast-2-Hybrid screens19 HOVERGEN: a Database of Homologous Vertebrate GenesUseful for comparative sequence analysis, or molecular evolution studies 10 biggest gene families20 Why identify functional gene groups?Interesting to know functionally relevant groups for large gene group setsHelps to assess the significance of experimentally derived gene setsRefine gene groups to find more functionally relevant groupsExisting algorithms can make use of this information in finding gene groups21 Existing ApproachesUse of co occurrence of gene names in abstracts to create networks of related genes automaticallyUse existing vocabulary of gene functions and assigned genes to decide a functionally relevant group(Gene Ontology (GO) consortium and Munich Information Center for Protein Sequences (MIPS) )22 Statistical NLP approachUsed for annotating individual genesDetermining gene and protein interactionsAssigning keywords to genes or group of genes23 Neighbor Divergence ApproachStatistical NLP techniqueWill always be up to date if provided with a current literature base Cannot specify what the actual function is!24 Challenges in the ProblemLarge number of genes Genes have multiple functionsSome genes have been extensively studied, others recently discoveredSo the literature about genes reflects these differences25 Neighbor Divergence Intuition26 Neighbor Divergence AlgorithmRepresentation Of ArticlesIdentifying Semantic Neighbors for Corpus ArticlesScoring Articles Relative to Gene GroupCalculating a Theoretical distribution of ScoresCalculating the Difference between empirical and theoretical distribution27 ND- Article RepresentationWords in articles represented by their inverse document frequency (to reduce the impact of common words)Wi,j = 1 + (log2 (tfi,j))log2 (N/dfi) if tfi,j > 0Wi,j = 0 if tfi,j = 0 where Wi,j : weighted count of word i in document j, tfi,j : the number f times word i is in document dfi : the number of documents containing I N : the total number of documents28 ND – Identifying Semantic NeighborsFor each article, K most similar articles
View Full Document