Wright CS 707 - CS 707 Assignment # 1 (3 pages)

Previewing page 1 of 3 page document View the full content.
View Full Document

CS 707 Assignment # 1



Previewing page 1 of actual document.

View the full content.
View Full Document
View Full Document

CS 707 Assignment # 1

27 views

Other


Pages:
3
School:
Wright State University
Course:
Cs 707 - Information Retrieval

Unformatted text preview:

Wright State University Department of Computer Science and Engineering CS707 Winter 2011 Prasad Assignment 1 Due March 3 30 pts In this project you will design and implement your own information retrieval system The project has two phases In Phase I you will build the indexing component which will take a large collection of text and produce a searchable persistent data structure In Phase II you will add the searching component according to Vector Space Model The project may be done individually or in a group of two Both members of a group are expected to contribute to all aspects of the project design implementation documentation and testing Phase I Phase I of your project will read a set of files parse them into documents in case multiple logical documents are in the same physical file and terms and produce an inverted index associated data structures The latter will be stored on disk and used in Phase II Phase I will have two major components the lexical analyzer and the inverter You need to make explicit what assumptions you make about the structure and words in the documents You might choose to have your lexer configurable at run time the configuration file would then specify how to segment terms what tag indicates the start of a new document how to treat numbers etc Pay particular attention to the choice of data structures and algorithms Your program will need to save several data structures to disk At the minimum these will include the lexicon the table of words occurring in the text appropriate metadata statistics and pointers into the inverted file the document location list a table indicating where to find the files on disk at retrieval time and the inverted file itself Note that in order to fully implement the vector space model you may need to retain several pieces of information about the documents in the dictionary document location list inverted index data structures such as the total number of documents the maximum term frequency for each document the



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view CS 707 Assignment # 1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 707 Assignment # 1 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?