Stanford CS 224 - Study Notes - D2812316

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Study Notes

DOC PREVIEW

Stanford CS 224 - Study Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Annotating Protein Clusters and MotifsWe examined the results by hand for several of the BLOCKS. For BLOCK IPB001636E the corresponding InterPro entry (http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001636) is shown below:SAICAR synthetaseHistone-like bacterial DNA-binding proteinAnnotating Protein Clusters and MotifsSerge Saxonov and Iwei YehCS224N Final Project, Spring 2002Abstract:Proteins can be clustered into groups based on many biological metrics, such as sequencesimilarity or expression profiles. One question that arises after clustering occurs is: “What are the distinguishing characteristics of the cluster?” This question is often answered by going to the primary biomedical literature and doing some research. We wanted to see if we could come up with key words and phrases to differentiate clusters using NLP and the primary biomedical literature.Introduction:Given the explosion of biological data in recent years, it’s not surprising that several groups have tried NLP approaches for automated extraction and organization of biological knowledge. Among these were methods for extracting protein-protein interactions, associating genes with controlled-vocabulary terms and assigning sub-cellular localization properties to gene products. In this project we confront the problem of assigning annotation to protein sequence motifs and, more generally, clusters of genes. Protein motifs are short stretches of amino acids with sequence conservation across families of proteins and conserved structures and function. Much work has been done in automatic detection of protein motifs through multiple sequence alignment. Once a protein motif is identified, people research the function and biological significance of the motif, often by sifting through the primary literature.One source of automatically detected sequence motifs is BLOCKS. BLOCKS is a database of ungapped multiple sequence alignments over highly conserved regions (http://www.blocks.fhcrc.org/)(Henikoff, Henikoff et al. 1995). The proteins in each block are associated with a particular InterPro family. The multiple alignments in a blockmay then be clustered, providing a further subgrouping of the block. Most proteins in BLOCKS are annotated in SWISS-PROT (http://us.expasy.org/sprot/) anannotated sequence database(Bairoch and Apweiler 1997). Usually, included in the annotation are keywords from a controlled vocabulary for each gene and the PubMed ids for the primary literature from which the annotation has been abstracted.The PubMed ids are unique identifiers for citations and abstracts in MEDLINE, a database of biomedical references and abstracts maintained by the National Library of Medicine. These abstracts potentially hold a rich resource for biological information about the proteins grouped into BLOCKS.Our goal is to pull out meaningful keywords and phrases for a BLOCK to assist in annotation of the protein motif. We decided to take a statistical approach to finding keywords. This approach has been used to automatically annotate protein function from MEDLINE abstracts (Andrade and Valencia 1998). Here we do not limit our scope to protein function, but try to capture any relevant words about the group of genes.Methods:First given a list of BLOCK ids, we extracted all the SWISS-PROT ids associated with the BLOCK and partitioned the SWISS-PROT ids into subgroups, based on the clusteringwithin the BLOCK.We looked through all SWISS-PROT entries (>100,000 proteins) and pulled out all the PubMedIds that were associated with annotations (74,528 ids). We then retrieved the abstracts from PubMed in HTML format.We preprocessed the abstracts by stripping out the abstract portion in text. Next, we made sentence calls by looking for a period, question mark or exclamation point followedby a digit or capital letter. We replaced numbers and percentages with their respective reserved word. We also removed some stopwords. All periods, commas, colons and semicolons between words were removed. We also removed parentheses around words or phrases. It should be noted that due to the nature of our domain many valid and useful words contained punctuation such as dashes, parentheses and periods. Therefore we decided against a simple deletion of all punctuation.Once the abstracts were cleaned up, we were able to obtain a vocabulary for our corpus and calculate the corresponding word and bigram frequencies (within the same sentence).These frequencies followed Zipf’s law.One of our concerns has been the sparsity of the domain. For that reason we examined the effects of stemming on the results. We used the vocabulary (constructed above) as input to the Porter Stemmer (Porter 1980), which is a lexicon free grammar based on simple cascaded rules. This is a small and fast algorithm, but does make mistakes of bothomission and commission. The frequencies of words mapped to per stem followed a Zipfian distribution (Figure 1). We examined the outcome of the stemming for domain specific words (Table 1).Figure 1.carboxyl CarboxylCarboxyl CarboxylateCarboxyl CarboxylatedCarboxyl CarboxylatesCarboxyl CarboxylationCarboxyl Carboxyliccarboxyl-termin carboxyl-terminalcarboxyl-termin carboxyl-terminalsCarboxylas CarboxylaseCarboxylas Carboxylasescarboxylesteras carboxylesterasecarboxylesteras carboxylesterasesIonic IonicIonic IonicallyIonis IonisableIonis IonizationIonis IonizingIoniz IonizableIoniz IonizationIoniz IonizeIoniz IonizedIoniz IonizesIoniz IonizingTable 1.What one can take from this table or similar ones we have looked at is that stemming can create both desirable (the case of carboxyl) and undesirable mappings (the case of ionic –the distinction between ionized and ionizing is quite important). We employed our annotation extraction routines for two purposes. One was to annotate blocks relative the background distribution of literature referenced in SwissProt. The other goal was to annotate sub-blocks with blocks. The success of the first task is easier to measure because most blocks have some annotation attached to them already. The performance of an annotation extractor can be judged to some extent by comparing the results with preexisting keywords. The second task is more difficult in that the amount of available literature is smaller (we are dealing with a sub-block , not blocks) and that the sub-blocks are not annotated. To make or task more manageable in this stage we picked six random blocks to investigate. The following is

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Stanford CS 224 - Study Notes

Sign up for free to view:

Please select your school