Unformatted text preview:

Wright State UniversityDepartment of Computer Science and EngineeringCS707 Winter 2011 PrasadAssignment 1 (Due: March 3) (30 pts)In this project you will design and implement your own information retrieval system. The project has twophases. In Phase I, you will build the indexing component, which will take a large collection of text and producea searchable, persistent data structure. In Phase II, you will add the searching component, according to VectorSpace Model.The project may be done individually, or in a group of two. Both members of a group are expected to contributeto all aspects of the project: design, implementation, documentation, and testing.Phase IPhase I of your project will read a set of files, parse them into documents (in case multiple logical documents arein the same physical file) and terms, and produce an inverted index associated data structures. The latter will bestored on disk and used in Phase II.Phase I will have two major components: the lexical analyzer and the inverter . You need to make explicitwhat assumptions you make about the structure and words in the documents. You might choose to have your lexerconfigurable at run-time; the configuration file would then specify how to segment terms, what tag indicates thestart of a new document, how to treat numbers, etc. Pay particular attention to the choice of data structures andalgorithms.Your program will need to save several data structures to disk. At the minimum, these will include the lexicon(the table of words occurring in the text, appropriate metadata/statistics, and pointers into the inverted file), thedocument location list (a table indicating where to find the files on disk at retrieval time), and the inverted fileitself.Note that in order to fully implement the vector space model you may need to retain several pieces of informationabout the documents in the dictionary / document location list / inverted index data structures such as the totalnumber of documents, the maximum term frequency for each document, the length of the weight vector for eachdocument, etc. Try to save values that you anticipate are needed to calculate cosine similarity that can be generatedonly with high computational cost if not present.Phase I resourcesHere are some resources you might find useful in Phase I:Stop ListsThese are some commonly-used stoplists. You may find that you want to edit them somewhat.• SMART’s stop listhttp://www.lextek.com/manuals/onix/stopwords2.html• Stop list from van Rijsbergen’s textbookhttp://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words• Stop list from Frakes and Baeza-Yates’ textbookftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/StemmerMartin Porter’s web page http://tartarus.org/~martin/PorterStemmer/ contains implementations of thePorter Stemmer in a number of languages.1Milestones for Phase I1. Parse the files containing the documents into individual documents and words.2. Invert the collection in memory.3. Store the inverted file and associated data structures to disk.Benchmarks for Phase I1. Which collection are you using for your project?2. Corpus statistics:• How many documents are there in the collection?• How many words are there in the collection and how many unique words do you index from the collection?• How many postings (inverted file entries) did you create for this collection? What are the lengths of theshortest and longest postings lists? What is the average postings list length? (State your answers interms of number of document entries per posting.)3. Inverting collection:• How much time did it take to index the collection?• How much memory is needed to index the collection?• How much total disk space is required for your index and related data structures? This total shouldinclude everything created in your indexing process that is needed later to answer queries (lexicon,inverted file, document location table, etc).Phase IIFor Phase II, you will write a program which will accept queries from the user and search for documents usingthe data structures produced in Phase I. Implement the inverted search algorithm using the vector space model torank the documents, carefully choosing the weighting function.Your search interface must allow the user to:1. Input a search query. The query should be a free-text or keyword-based query.2. View a list of search results. Each result should display an internal document ID (sequence number), a titleor URL, and possibly a snippet of text from the document.3. Choose a document from the list to view. This should fetch the document from where it is stored and displayit to the user. In other words, implement the basic, single-query-and-result-list search process. Beyond thesecentral requirements, you are free to make design and interface decisions as you see fit.Milestones for Phase II1. Implement the vector space model and inverted file based search algorithm.2. Take a query interactively and display ranked results. Allow the user to select documents to read from thelist.3. Evaulate your search engine in terms of standard metrics such as precision and recall.2Benchmarks for Phase II• Process a user query. Take a (single or multi-word) query interactively from the user, retrieve the top 100documents from the collection, show the top 20 document identifiers to the user with scores, and let the userchoose one to display. Report the amount of time needed for the entire interaction, from after the user firstenters their query to when they can see a full document on the screen.Now repeat this for a number of queries tabulating the queries used and the (wall-clock) time it took tosearch the collection, display the top 20 results, allow the user to select one document, and display theselected document.• Evaluate the search engine. For each test query, provide a table showing precision, recall, F-measure, etc.and/or a variant of precision-recall graph.DeliverablesYou will turn in the following four items for each phase in the form of zip-archives: PhaseI.zip and PhaseII.zip.Design document Describe your application’s architecture and major data structures. Describe how your systemimplements the indexing and search operations (i.e., give pseudo-code for your search algorithm). List anyexternal libraries or resources required by your application.You are encouraged to write the design document prior to writing the code for planning purposes, althoughit will certainly change during the course of the


View Full Document

Wright CS 707 - CS 707 Assignment # 1

Download CS 707 Assignment # 1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 707 Assignment # 1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 707 Assignment # 1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?