CS276B Text Information Retrieval, Mining, and ExploitationBioinformatics TopicsText-Enhanced Homology Search (Chang, Raychaudhuri, Altman)Sequence Homology DetectionPSI-BLASTSlide 6PSI-BLAST Problem: Profile DriftAddressing Profile DriftSlide 9Modification to PSI-BLASTEvaluationSlide 12Slide 13ResultsSlide 15DiscussionMining Text in Biological DatabasesWhere is the Information? What is the Data?Genetic Information in GenBankSpecies represented in GENBANKComplete GenomesSlide 22Protein SequencesThree-Dimensional StructuresSlide 25Complete yeast genome (6000 genes) on a chip.Online access to DNA chip DataSlide 28A Reaction in EcoCYCSlide 30Slide 31Slide 32Signaling PathwaysSlide 34Where’s the Information?PubMedSwissProtAbstracts Referenced in SP37Slide 39MESH = Medical Entity Subject HeadingsMESHUMLS: Semantic Model of Biomedical LanguageUMLS ElementsGene Ontology (http://www.geneontology.org/)Molecular FunctionCurrent Genome Annotations http://www.geneontology.orgSlide 47KDD Cup 2002: Information Extraction for Biological TextTask Background: FlybaseFlyBase: Example of Data CurationCurators Cannot Keep Up with the Literature!Task Rationale and DescriptionSome Data (Text) Preparation ChallengesSlide 54Some Data (Text) Preparation Challenges (Continued)Slide 56Information Extraction TaskTask is Harder Than It First AppearsTraining Data in FlybaseTypical NLP Training Data: More DetailedTask DetailsSome NumbersSlide 63SummaryCurated DatabasesSlide 66Curated Databases: UsesE-Cell (http://e-cell.org/)Curated Databases: Uses (cont.)Combining Text Mining and Data MiningCombining Text and LinksClustering: Example (Eisen et al.)Combining Gene Expression&TextCommentsLiterature as a guideGoal of algorithmProjections in Linear Discriminant AnalysisOur approachChallengesResourcesLinks to Today’s TopicsSlide 82CS276BText Information Retrieval, Mining, and ExploitationLecture 16Bioinformatics IIMarch 13, 2003(includes slides borrowed from J. Chang, R. Altman, L. Hirschman, A. Yeh, S. Raychaudhuri)Bioinformatics TopicsLast weekBasic biologyWhy text about biology is specialText mining case studiesMicroarray analysis, Abbreviation miningTodayCombined text mining and data mining IText-enhanced homology searchText mining in biological databasesKDD cup: Information extraction for bio-journalsCombining text mining and data mining IIText-Enhanced Homology Search(Chang, Raychaudhuri, Altman)Sequence Homology DetectionObtaining sequence information is easy; characterizing sequences is hard.Organisms share a common basis of genes and pathways.Information can be predicted for a novel sequence based on sequence similarity:FunctionCellular roleStructurePSI-BLASTUsed to detect protein sequence homology. (Iterated version of universally used BLAST program.)Searches a database for sequences with high sequence similarity to a query sequence. Creates a profile from similar sequences and iterates the search to improve sensitivity.PSI-BLAST Problem: Profile DriftAt each iteration, could find non-homologous (false positive) proteins. False positives create a poor profile, leading to more false positives.Addressing Profile DriftPROBLEM: Sequence similarity is only one indicator of homology.More clues, e.g. protein functional role, exists in the literature.SOLUTION: we incorporate MEDLINE text into PSI-BLAST.Modification to PSI-BLASTBefore including a sequence, measure similarity of literature. Throw away sequences with least similar literatures to avoid drift.Literature is obtained from SWISS-PROT gene annotations to MEDLINE (text, keywords).Define domain-specific “stop” words (< 3 sequences or >85,000 sequences) = 80,479 out of 147,639.Use similarity metric between literatures (for genes) based on word vector cosine.EvaluationCreated families of homologous proteins based on SCOP (gold standard site for homologous proteins--http://scop.berkeley.edu/ )Select one sequence per protein family:Families must have >= five membersAssociated with at least four referencesSelect sequence with worst performance on a non-iterated BLAST searchEvaluationCompared homology search results from original and our modified PSI-BLAST.Dropped lowest 5%, 10% and 20% of literature-similar genes during PSI-BLAST iterationsResults46/54 families had identical performance2 families suffered from PSI-BLAST drift, avoided with text-PSI-BLAST.3 families did not converge for PSI-BLAST, but converged well with text-PSI-BLAST2 families converged for both, with slightly better performance by regular PSI-BLAST.DiscussionProfile drift is rare in this test set and can sometimes be alleviated when it occurs.Overall PSI-BLAST precision can be increased using text information.Mining Text inBiological DatabasesWhere is the Information?What is the Data?GenBank – genetic sequencesSwiss-prot – protein sequencesDNA chips / microarraysMetabolic pathwaysSignaling pathways / regulatory networksMedline – biomedical literatureTaxonomies / OntologiesGenetic Information in GenBank•Numbers are for all species.•Biology is fundamentally an information science.Species represented in GENBANKEntries Bases Species4323294 7028540140 Homo sapiens2595599 1385749133 Mus musculus166778 488340565 Drosophila melanogaster182124 247830592 Arabidopsis thaliana114669 203787073 Caenorhabditis elegans189000 165542107 Tetraodon nigroviridis159412 136005048 Oryza sativa219183 107771966 Rattus norvegicus166688 75404535 Bos taurus155647 68679866 Glycine max109941 56390403 Lycopersicon esculentum70448 51527034 Hordeum vulgare104773 51202716 Medicago truncatula91352 50512383 Trypanosoma brucei56416 49410018 Giardia intestinalis77536 47598841 Strongylocentrotus purpuratus49939 44524589 Entamoeba histolytica86706 42479448 Danio rerio79696 37899117 Zea mays71318 37381894 Xenopus laevisComplete GenomesAquifex aeolicus Aquifex aeolicus Archaeoglobus fulgidus Archaeoglobus fulgidus Bacillus subtilis Bacillus subtilis Borrelia burgdorferi Borrelia burgdorferi Chlamydia trachomatis Chlamydia trachomatis Escherichia coli Escherichia coli Haemophilus influenzae Haemophilus influenzae Methanobacterium Methanobacterium thermoautotrophicum
View Full Document