Stanford CS 276B - Text Information Retrieval, Mining, and Exploitation - D3020386

Home> Schools> Stanford University> Computer Science (CS) > CS 276B> Text Information Retrieval, Mining, and Exploitation

Stanford CS 276B - Text Information Retrieval, Mining, and Exploitation

School name Stanford University

Course Cs 276b- Text Information Retrieval, Mining, and Exploitation

Pages 82

Download Save

Unformatted text preview:

CS276B Text Information Retrieval, Mining, and ExploitationBioinformatics TopicsText-Enhanced Homology Search (Chang, Raychaudhuri, Altman)Sequence Homology DetectionPSI-BLASTSlide 6PSI-BLAST Problem: Profile DriftAddressing Profile DriftSlide 9Modification to PSI-BLASTEvaluationSlide 12Slide 13ResultsSlide 15DiscussionMining Text in Biological DatabasesWhere is the Information? What is the Data?Genetic Information in GenBankSpecies represented in GENBANKComplete GenomesSlide 22Protein SequencesThree-Dimensional StructuresSlide 25Complete yeast genome (6000 genes) on a chip.Online access to DNA chip DataSlide 28A Reaction in EcoCYCSlide 30Slide 31Slide 32Signaling PathwaysSlide 34Where’s the Information?PubMedSwissProtAbstracts Referenced in SP37Slide 39MESH = Medical Entity Subject HeadingsMESHUMLS: Semantic Model of Biomedical LanguageUMLS ElementsGene Ontology (http://www.geneontology.org/)Molecular FunctionCurrent Genome Annotations http://www.geneontology.orgSlide 47KDD Cup 2002: Information Extraction for Biological TextTask Background: FlybaseFlyBase: Example of Data CurationCurators Cannot Keep Up with the Literature!Task Rationale and DescriptionSome Data (Text) Preparation ChallengesSlide 54Some Data (Text) Preparation Challenges (Continued)Slide 56Information Extraction TaskTask is Harder Than It First AppearsTraining Data in FlybaseTypical NLP Training Data: More DetailedTask DetailsSome NumbersSlide 63SummaryCurated DatabasesSlide 66Curated Databases: UsesE-Cell (http://e-cell.org/)Curated Databases: Uses (cont.)Combining Text Mining and Data MiningCombining Text and LinksClustering: Example (Eisen et al.)Combining Gene Expression&TextCommentsLiterature as a guideGoal of algorithmProjections in Linear Discriminant AnalysisOur approachChallengesResourcesLinks to Today’s TopicsSlide 82CS276BText Information Retrieval, Mining, and ExploitationLecture 16Bioinformatics IIMarch 13, 2003(includes slides borrowed from J. Chang, R. Altman, L. Hirschman, A. Yeh, S. Raychaudhuri)Bioinformatics TopicsLast weekBasic biologyWhy text about biology is specialText mining case studiesMicroarray analysis, Abbreviation miningTodayCombined text mining and data mining IText-enhanced homology searchText mining in biological databasesKDD cup: Information extraction for bio-journalsCombining text mining and data mining IIText-Enhanced Homology Search(Chang, Raychaudhuri, Altman)Sequence Homology DetectionObtaining sequence information is easy; characterizing sequences is hard.Organisms share a common basis of genes and pathways.Information can be predicted for a novel sequence based on sequence similarity:FunctionCellular roleStructurePSI-BLASTUsed to detect protein sequence homology. (Iterated version of universally used BLAST program.)Searches a database for sequences with high sequence similarity to a query sequence. Creates a profile from similar sequences and iterates the search to improve sensitivity.PSI-BLAST Problem: Profile DriftAt each iteration, could find non-homologous (false positive) proteins. False positives create a poor profile, leading to more false positives.Addressing Profile DriftPROBLEM: Sequence similarity is only one indicator of homology.More clues, e.g. protein functional role, exists in the literature.SOLUTION: we incorporate MEDLINE text into PSI-BLAST.Modification to PSI-BLASTBefore including a sequence, measure similarity of literature. Throw away sequences with least similar literatures to avoid drift.Literature is obtained from SWISS-PROT gene annotations to MEDLINE (text, keywords).Define domain-specific “stop” words (< 3 sequences or >85,000 sequences) = 80,479 out of 147,639.Use similarity metric between literatures (for genes) based on word vector cosine.EvaluationCreated families of homologous proteins based on SCOP (gold standard site for homologous proteins--http://scop.berkeley.edu/ )Select one sequence per protein family:Families must have >= five membersAssociated with at least four referencesSelect sequence with worst performance on a non-iterated BLAST searchEvaluationCompared homology search results from original and our modified PSI-BLAST.Dropped lowest 5%, 10% and 20% of literature-similar genes during PSI-BLAST iterationsResults46/54 families had identical performance2 families suffered from PSI-BLAST drift, avoided with text-PSI-BLAST.3 families did not converge for PSI-BLAST, but converged well with text-PSI-BLAST2 families converged for both, with slightly better performance by regular PSI-BLAST.DiscussionProfile drift is rare in this test set and can sometimes be alleviated when it occurs.Overall PSI-BLAST precision can be increased using text information.Mining Text inBiological DatabasesWhere is the Information?What is the Data?GenBank – genetic sequencesSwiss-prot – protein sequencesDNA chips / microarraysMetabolic pathwaysSignaling pathways / regulatory networksMedline – biomedical literatureTaxonomies / OntologiesGenetic Information in GenBank•Numbers are for all species.•Biology is fundamentally an information science.Species represented in GENBANKEntries Bases Species4323294 7028540140 Homo sapiens2595599 1385749133 Mus musculus166778 488340565 Drosophila melanogaster182124 247830592 Arabidopsis thaliana114669 203787073 Caenorhabditis elegans189000 165542107 Tetraodon nigroviridis159412 136005048 Oryza sativa219183 107771966 Rattus norvegicus166688 75404535 Bos taurus155647 68679866 Glycine max109941 56390403 Lycopersicon esculentum70448 51527034 Hordeum vulgare104773 51202716 Medicago truncatula91352 50512383 Trypanosoma brucei56416 49410018 Giardia intestinalis77536 47598841 Strongylocentrotus purpuratus49939 44524589 Entamoeba histolytica86706 42479448 Danio rerio79696 37899117 Zea mays71318 37381894 Xenopus laevisComplete GenomesAquifex aeolicus Aquifex aeolicus Archaeoglobus fulgidus Archaeoglobus fulgidus Bacillus subtilis Bacillus subtilis Borrelia burgdorferi Borrelia burgdorferi Chlamydia trachomatis Chlamydia trachomatis Escherichia coli Escherichia coli Haemophilus influenzae Haemophilus influenzae  Methanobacterium Methanobacterium thermoautotrophicum

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 276B - Text Information Retrieval, Mining, and Exploitation

Sign up for free to view:

Please select your school