Stanford CS 224 - Selecting a Biomedical Funding Source - D2125632

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Selecting a Biomedical Funding Source

DOC PREVIEW

Stanford CS 224 - Selecting a Biomedical Funding Source

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Selecting a Biomedical Funding Source Based on Grant Abstracts Yi Liu, Ahmet Altay Cover 1. Cloud of Biomedical Grants’ Eigen-words Subject to stemming, the above words are sampled from those that correspond to extreme components of 100 eigenvectors corresponding to top singular values of our TF-IDF matrix. To a large extent, they determine where you should go for funding (in biomedical grant applications). We used Wordle to generate this figure [6]. March 9, 2011 CS224N Final ProjectIntroduction In biomedical research there are many sources of federal funding. For a given research idea, it is particularly important to choose the right institution for funding. Many biomedical funding agencies’ missions are interested in solving similar problems, so the ability to find a funding agency for a given abstract has practical merit. Using data based on past funding records, we demonstrate in leave-one-out cross-validation that term-frequency-inverse-document-frequency (TF-IDF) and singular value decomposition (SVD) combined with k-nearest-neighbors (KNN) provides excellent performance in correctly identifying the funding source, given contents of an abstract. These results are consistent with previous attempts to suggest journals for a given abstract [4]. Data Cleaning We have acquired biomedical grant summaries from over 30 institutions between the period 1972 and 2009. Most of these grants were of a biomedical context and come from an institution under the National Institution of Health Umbrella. Unfortunately, not all our raw data were meaningful. There are many cases of “null”, “data not available”, “<br/>”, “Abstract/Summary”, and so forth. Sometimes, the descriptions were not particularly meaningful and excessively brief. Ultimately, we decided to restrict our attention to summaries that contained at least 500 characters. Another issue was that in the funding system, people would frequently use the same grant text multiple times from year to year as grants were renewed; the only differences, if any, between these texts tended to be numerical (e.g. grant-ids, etc). To prevent double counting, we removed duplicates of the same grant when there are no alphabetical differences between different grants. Finally some of the grants were littered with html tags, with typos, and with fused-words (e.g. “thebindingsiteof” ). We suspect that there were multiple iterations of manual transcription during the creation of this dataset. To get around these issues with the free-text, we ultimately decided that we would favor a bag-of-words approach using a pre-determined dictionary, rather than the set of all tokens ever seen. For our analysis, we only considered institutions from which we have at least 5000 grants available. We randomly sampled 5000 grants from each of the 19 institutions satisfying our criteria. The list of institutions is: Table 1. List of Institutions NIDCD National Institute of Deafness and Other Communication Disorders NIGMS National Institute of General Medical Sciences NEI National Eye Institute NIAID National Institute of Allergy and Infectious DiseasesNIAAA National Institute on Alcohol Abuse and Alcoholism NIDDK National Institute of Diabetes and Digestive Kidney Diseases NIDA National Institute on Drug Abuse NHLBI National Heart, Lung, and Blood Institute NICHD National Institute of Child Health and Human Development NSF National Science Foundation NINDS National Institute of Neurological Disorders and Stroke NCRR National Center for Research Resources NIHM National Institute of Mental Health NIEHS National Institute of Environmental Health Sciences NIA National Institute on Aging NCI National Cancer Institute NIDCR National Institute of Dental and Craniofacial Research DOD Department of Defense NIAMS National Center of Arthritis and Musculoskeletal and Skin Diseases Dictionary Generation In part due to the typos and fused words in our data, we decided to use a pre-determined external dictionary. We chose the Carnegie Mellon University Pronouncing Dictionary, which contains 133286 words [1]. Amongst these words, 9680 were designed to address pronunciation specific concerns so we ignored them. We also decided to remove 319 stop words chosen from a blog by Armand Brahaj [2]. Finally, we also used the Porter Stemming algorithm to further collapse our dictionary. The implementation was in python and written by Michael Dirolf [3]. Stemming converts multiple conjugations, multiple tenses, and pluralities to be condensed into a single canonical form by following general rules in English. When we apply our dictionary in a bag-of-words approach, we would treat multiple occurrences of members from the same stemming equivalence class as the same word. Words in the same steaming equivalence class as stop words were also ignored. Subsequent to removal of stop words and collapsing by stemming, we arrived at a dictionary of size 83435. We accepted 120636 distinct spellings. Figure 1a. Histogram for stems per abstractFigure 1b. Histogram for number of appearances per unique stem Data Analysis PipelineWe used a variant of grep written by Dai et al. at University of Michigan[5] to annotate our document with all occurrences of our dictionary words in our sample of grants. Subsequently, the data became a sparse matrix in Matlab. We also generated a TF-IDF matrix of the same size using the raw term-document frequency matrix. Because of our sheer volume of data, it turns out that our sparse data was not sparse enough for feasible manipulation so we resorted to SVD. The singular values dropped very sharply as shown in Figure 2. Based on this values we decided to first 100 eigenvectors to represent a simpler form of TFIDF. Figure 2. Distribution of Singular Values Analysis of Eigenvectors For the first 9 eigenvectors we show breakdown the components with highest absolute value. Because the vast majority of the components are close to 0, we ignored values that were within [-0.02, 0.02] in Figure 3. The first eigenvector consist of non-negative components exclusively. Figure 3. The First 9 EigenvectorsContributions of the top words fall off quire quickly. This is pattern applies for rest of the eigenvectors and not specific to those we show in the figure. We also considered words that are associated with top components of top eigenvectors. These words have the biggest impact our classification system. Table 2 shows the top words from

View Full Document