U of I CS 466 - Biological literature mining

Unformatted text preview:

Biological literature miningExample sentenceInformation Retrieval: finding the papersAd hoc IRAd hoc IR: tricksSlide 6Entity recognition (ER)ER goalsER approaches: rule basedER approaches: dictionary basedWhy ER is difficultInformation Extraction (IE)IE approaches: co-occurrenceIE approaches: NLPSummaryAutomatically Generating Gene Summaries from Biomedical LiteratureOutlineMotivationAn Ideal Gene SummaryProblem with Manual ProcedureThe solutionSlide 22System Overview: 2-stageKeyword Retrieval Module (IR)KR moduleGene SynSet Construction & Keyword RetrievalInformation Extraction ModuleIE moduleTraining Data GenerationSentence ExtractionScoring strategiesSlide 32Summary generationSlide 34ExperimentsEvaluationPowerPoint PresentationPrecision of the top-k sentencesDiscussionSummary example (Abl)Summary example (Camo|Sod)Slide 42Slide 43Conclusion and future workReferencesVector Space ModelBiological literature mining•Information retrieval (IR): retrieve papers relevant to specific keywords•Entity recognition (ER): specific biological entities (e.g., genes) identified in papers•Information extraction (IE): enable specific facts to be automatically pulled out of papersExample sentence“Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation” Its context is the cell cycle of the yeast Saccharomyces cerevisiae and it allows us to demonstrate the powers and pitfalls of current literature-mining approaches.Information Retrieval: finding the papers•Aim is to identify text segments pertaining to a particular topic (here, “yeast cell cycle”)•Topic may be a user provided query –ad hoc IR•Topic may be a set of papers–text categorizationAd hoc IR•Pubmed is an example•Supports “boolean model” as well as “vector model”•Boolean model: combination of terms using logical operations (OR, AND)•Vector model: We’ll see more of this laterAd hoc IR: tricks•Lessons learned from regular IR also applicable to biomedical literature•Removal of “stop words” such as the, it, etc.•Truncating common word endings such as -ing, -s•Use of thesaurus to automatically “expand” query–e.g., “yeast AND cell cycle” => “(yeast OR Saccharomyces cerevisae) AND cell cycle”Ad hoc IR“Even with these improvements, current ad hoc IR systemsare not able to retrieve our example sentence whenthey are given the query ‘yeast cell cycle’. Instead, thiscould be achieved by realizing that ‘yeast’ is a synonymfor S. cerevisiae, that ‘cell cycle’ is a Gene Ontology term,that the word ‘Cdc28’ refers to an S. cerevisiae proteinand finally, by looking up the Gene Ontology termsthat relate to Cdc28 to connect it to the yeast cell cycle.”Entity recognition (ER)•Goal: to identify biological entities (e.g., genes, proteins) in text•Two sub-goals:–recognition of the words in text that represent these entities–unique identification of these entities (the synonym problem)ER goals•In our example, Clb2, Cdc28, Cdk1, Swe1, Cdc5 should be recognized as gene or protein names•Additionally, they should be identified by their respective “Saccharomyces Genome Database” accession numbers•Perhaps the most difficult task in biomedical text miningER approaches: rule based•Manually built rules that look for typical features of names, e.g., names followed by numbers, the ending “-ase”, occurrences of word “gene”, “receptor” etc in proximity•Automatically built rules using machine learning techniquesER approaches: dictionary based•Comprehensive list of gene names and their synonyms•Matching algorithms that allow variations in those names, e.g., ‘CDC28’, ‘Cdc28’, ‘Cdc28p’ or ‘cdc-28.•Advantage: they can also associated the recognized entity with its unique identifierWhy ER is difficult•Each gene has several names and abbreviations, e.g., ‘Cdc28’ is also called ‘Cyclin-dependent kinase 1’ or ‘Cdk1’•Gene names may also be –common english names, e.g., hairy–biological terms, e.g., SDS–names of other genes, e.g., ‘Cdc2’ refers to two different genes in budding yeast and in fission yeastInformation Extraction (IE)•IR extracts texts on particular topics•IE extracts facts about relationship between biological entities•e.g., deduce that –Cdc28 binds Clb2, –Swe1 is phosphorylated by the Cdc28–Clb2 complex –Cdc5 is involved in Swe1 phosphorylationIE approaches: co-occurrence•Identify entities that co-occur in a sentence, abstract, etc.•Two co-occuring entities may be unrelated, but if they co-occur repeatedly, then likely related. Therefore, some statistical analysis used•Finds related entities but not necessarily the type of relationshipIE approaches: NLP•Natural Language Processing (NLP)•Tokenize text and identify word and sentence boundaries•Part of speech tag (e.g., noun/verb) for each word•Syntax tree for each sentence, delineating noun phrases and their interrelationships•ER used to assign semantic tags for biological entities (e.g., gene/protein names)•Rules applied to syntax tree and semantic labels to extract relationships between entitiesSummary•Information retrieval: getting the texts•Entity recognition: identifying genes, proteins etc.•Information extraction: recovering reported relationships between entitiesAutomatically Generating Gene Summaries from Biomedical Literature(Ling et al. PSB 2006)CS 466Outline•Introduction–Motivation•System–Keyword Retrieval Module–Information Extraction Module•Experiments and Evaluations•Conclusion and Future WorkMotivation•Finding all the information we know about a gene from the literature is a critical task in biology research •Reading all the relevant articles about a gene is time consuming •A summary of what we know about a gene would help biologists to access the already-discovered knowledgeAn Ideal Gene Summary•http://flybase.org/reports/FBgn0000017.htmlGPELSIGIMPWFPIAbove summary is from ca. 2006Problem with Manual Procedure•Labor-intensive•Hard to keep updated with the rapid growth of the literature informationHow can we generate such summaries automatically?The solution•Structured summary on 6 aspects1. Gene products (GP)2. Expression location (EL)3. Sequence information (SI)4. Wild-type function and phenotypic information (WFPI)5. Mutant phenotype


View Full Document

U of I CS 466 - Biological literature mining

Download Biological literature mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Biological literature mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Biological literature mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?