DOC PREVIEW
Stanford CS 224 - Lecture Notes

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1. Introduction2. Related work3. Implementation details1.1. POS-based query cleansing1.2. Noun phrase query expansion1. Noun phrase extraction2. Query transformation3. Results1.3. Thesaurus-based query expansion4. Future directions5. Conclusions6. References7. Appendix1.4. Usage1.5. Files1. Source files2. Data files3. Configuration files4. Binary files5. ReportCS224N final projectApplying NLP techniques to improve Cranfieldcollection IR precisionName: Oleg SlezbergStudent ID: 5322713SUNetID: olegsEmail: [email protected] two late days, submitted on June 3 at 4:30 PM. 1. INTRODUCTION 22. RELATED WORK 23. IMPLEMENTATION DETAILS 33.1. POS-based query cleansing 33.2. Noun phrase query expansion 43.1.1. Noun phrase extraction 43.1.2. Query transformation 53.1.3. Results 63.3. Thesaurus-based query expansion 64. FUTURE DIRECTIONS 85. CONCLUSIONS 86. REFERENCES 87. APPENDIX 97.1. Usage 97.2. Files 107.2.1. Source files 107.2.2. Data files 107.2.3. Configuration files 107.2.4. Binary files 107.2.5. Report 111. IntroductionCranfield collection [1] is a well-known IR test collection containing 1,400 aerodynamics' documents and 225 common queries. It also includes a set of correct results from human experts, which make the collection suitable for an automatic IR system evaluation.The queries are formulated in plain English. For example, the Q1 is:What similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft?Remember the Q1 – we will be using it as an example throughout the document.Due to the format of queries and a very specific domain, achieving high precision on Cranfield collection is not easy. Measured at 10 retrieved documents, Lucene [2] returns only 37.3% out of the box. In fall 2004, CS276A students attempted to tune Lucene's scoring model (TF-IDF) to increase this value [3]. These efforts were largely unsuccessful, resulting in precisions below 45% [4]. In addition to that, [9] used four different methods to increase the precision. However, their best precision was only 41.2%(using LSI).This project tackled this problem from a different angle. Instead of tuning Lucene's scoring model, we improved the precision by leveraging the fact that the Cranfield queries are formulated in the natural language. This paved way to several query transformations based on the NLP understanding of the queries.The project implemented three query transformations: a simple POS-based query cleansing, noun phrase query expansion, and fixed thesaurus-based query expansion (using WordNet [6]). In addition to Lucene and WordNet, Stanford parser [5] was used in these tasks.The original project proposal also planned to implement an automatically generated simi-larity thesaurus [11]. However, after discovering that LSI was already applied to the Cranfield collection in [9], I decided to implement WordNet and noun phrase expansion instead (as LSI and automatically generated thesaurus are conceptually very similar).The rest of the document will provide more details on both the status of research in these areas, as well as the results achieved by this implementation.2. Related workSome related work on NLP IR was done in the last decade, mostly in the context of the TREC [7] competition. The results, however, varied. Noun phrases were successfully used in [12] and [13]. [12] reported precision improvement of 2.5% by using them. The paper advocates a use of “phrase-based indexing”. This is something we get for free from Lucene (once the query is constructed properly).Even though query expansion is usually effective in improving recall, [14] also showed that using similarity-based query expansion can improve precision. Unfortunately, they do not provide results of applying this technique standalone; their precision figures include other methods as well.On other hand, [8] claims that the attempts to use NLP in IR failed and cites numerous failure examples. The presentation, however, accepts that NLP can still be useful for a small domain with short documents, which is exactly what Cranfield collection is.Other known attempt at Cranfield collection is [9]. This work employed four vector-based IR methods: vector space model (VSM), latent semantic indexing (LSI), general-ized vector space model (GVSM), and approximate dimension equalization (ADE) (es-sentially a combination of the prior three). Their achieved precisions varied from 21.7% (GVSM) to 41.2% (LSI).3. Implementation detailsWe started from the CS276A PE1 starter code, which gives a precision of 37.4%. Pretty soon it was clear that all our transformations require a stemmer in order to work properly.So we reapplied stemming to the baseline code as well, receiving a baseline precision of 38.7%. This is the baseline precision we will be comparing ourselves against.In general, the Cranfield queries return from 159 to 1,356 hits on the baseline. Given the fact that the evaluation was on the top 10 documents, it is obvious that our job of finding proper top 10 was not simple.Note that we did not reuse any CS276A PE1 TF-IDF optimizations, in order to enable a fair comparison with the former.All precision numbers in this document refer to an average precision of 225 Cranfield queries measured at 10 retrieved documents.1.1. POS-based query cleansingThe default Lucene stop word list contains only 35 terms. This is not nearly enough to filter out noise words in the Cranfield queries.On the other hand, applying Cornell SMART system stop word file [15] (containing 571 terms) reduced the precision to 38.4%. It was an unexpected result as using the same stopword file improves precision when applied together with some TF-IDF parameter manipulation.So the compromise idea was to filter out words based on their part of speech. This, actually, worked well.The best precision was 39.3% (+0.6%). It was achieved when stripping off the following parts of speech: -LRB-, -RPB-, WDT, WP, DT, TO, CC, ‘.’, SYM, PRP, PRP$, EX, POS.I.e. our Q1 becomessimilarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraftOther things tried:1. Removing MD significantly decreases the precision (-0.2%). This is one of the reasons we perform better than the larger stop word list (which contains all the modals).2. The above POS list is optimal; removing any other parts of speech reduced the precision. So the initial thought that you can throw away adverbs and increase precision was plain wrong – apparently all


View Full Document

Stanford CS 224 - Lecture Notes

Documents in this Course
Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?