DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Text Classification and Information Extraction from Abstracts of Randomized Clinical Trials: One step closer to personalized semantic medical evidence searchRong XuYael GartenStanford Biomedical Informatics Training ProgramFinal Project: CS224NSpring, 2006Table of ContentsAbstract……………………………………………………………………………. 1Introduction……………………………………………………………………….. 2Methods…………………………………………………………………………… 3Step 1: Classification of abstract into 5 sections………………………… 3Step 2: Classification of sentences into population-related vs. other……… 5Step 3: Information Extraction from population-related sentences…………6Type 1: Extracting total number of participants………….…………6Type 2: Extracting participant descriptors…….………….…………8Type 3: Extracting disease…………………….………….…………8Results and Discussion……………………………………………………………..10Step 1: Classification of abstract into 5 sections…………………………...10Step 2: Classification of sentences into population-related vs. other………10Step 3: Information Extraction from population-related sentences………...11Type 1: Extracting total number of participants……………………11Type 2: Extracting participant descriptors…….………….………...12Type 3: Extracting disease…………………….………….………...14Conclusions…………………………………………………………………………15Future Work…...……………………………………………………………………15References…………………………………………………………………………..16Appendix A – Example abstracts (structured and unstructured)……………………… 17Appendix B – Grammar………..…………………………………………………… 19Appendix C – Closed sets………………………………………………………… 20Appendix D – Table of semantic types used by MetaMap…………………………… 21Appendix E – Example of abstract with “Participants” section……………………… 22ABSTRACTPatients desire medical information, such as efficacy of various treatments or side effects of drugs administered in clinical trials, that is specifically relevant to their demographic group and medical condition. Current search methods are usually keyword-based, and as such, do not incorporate semantics into the search algorithm. The results are either a wealth of information which contains a high degree of irrelevant hits, or at the other extreme, no results at all, because the specific query words were not found although their semantic equivalents do appear quite often in the database. Thus, the solution is to perform semantic search. In order to do this, information extraction must be performed a priori. The current study extracts the desired information from randomized clinical trial (RCT) papers. RCTs are one of the most important sources of scientific evidence on the safety and effectiveness of health interventions. Patients and the general public need such information, to help make treatment decisions. Ideally, a person will be able to enter a query about the disease and population groups of interest to him, and receive synthesized information from all RCT papers that describe a study performed on a similar population or disease. The current work takes a step in that direction. We developed a system for information extraction from abstracts of RCT papers, of specific fields of interest: patient demographic descriptors (“males”, “elderly”), medical conditions and descriptors (“multiple sclerosis”, “severe diabetes” ), and total number of patients in the study (which is indicative of the quality of the study). In order to perform information extraction, we divided our task into three parts. First, we separated unstructured abstracts into five sections (Background, Objective, Methods, Results, Conclusions) using text classification and a Hidden Markov Model. We achieved a high accuracy rate of about 94% in average. Secondly, we classified the sentences in the Methods sections into two classes: those that are trialparticipant-related, and those that are not, and achieved an overall performance of 91%. Thirdly, we extracted specific types of information from participant-related sentences: total number of participants, demographic information related to the participants, and medical information (disease, symptoms). Accuracy for extraction of the number of participants is 92.5% and 82.5% for the demographic information.1INTRODUCTIONToday, when patients use search engines to obtain medical information relevant to a particular medical condition and demographic group, they often use search engines like Google an Pubmed, two keyword based search engines. For example, when an elderly Caucasian diabetic male uses the search query “diabetes Caucasian 70 year old male”, he receives 166,000 search hits in Google, and zero in Pubmed. The results of Google may include pages upon pages of hits that are completely irrelevant. For example, one of the top-scoring hits is a document that discusses a 30-year old African American female with diabetes whose 70-year old father has Parkinson’s disease. At the other extreme, Pubmed is so specific a search engine that it does not map “70 year old” to elderly and thus retrieves zero hits (whereas a search on “diabetes Caucasian elderly male” does retrieve 25 hits). There is a very real and urgent need for the development of an authoritative personalized semantic medical-evidence search engine. This study takes a step in that direction. Pubmed, run by the National Library of Medicine, is the most reliable source of medical information today, and within it lies a subset of 204,000 papers called Randomized Clinical Trial papers, or RCTs, which provide reliable medical evidence. RCT papers usually report on the results of a treatment or intervention that was carried out on a specific small group of participants, usually as treatment to a particular disease.Authors of RCT papers usually include a few sentences on each of five general topics in the abstract of the paper: Background, Objective, Methods, Results,


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?