UMD CMSC 723 - Information Retrieval - D2594950

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> Information Retrieval

UMD CMSC 723 - Information Retrieval

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 12

Download Save

Unformatted text preview:

Text Retrieval Online library catalogs OPAC Introduction to Computational Linguistics Internet search engines such as AltaVista Google Information Retrieval Specialized systems aka vendors MEDLINE medical articles Lexis Nexis legal business academic Westlaw legal articles Dialog business information Christof Monz Introduction to Computational Linguistics IR 1 What is Information Retrieval 3 Retrieval vs Browsing Finding relevant information in large collections of data Popular Web Directories Yahoo Open Directory Project dmoz In such a collection you may want to find Give me information on the history of the Kennedys An article about the Kennedys text retrieval What does a brain tumor look like on a CT scan A picture of a brain tumor image retrieval It goes like this hmm hmm hahmmm A certain song music retrieval Introduction to Computational Linguistics IR Introduction to Computational Linguistics IR The user has to guess the right directories to find the information The user has to adapt to the designers conceptualization of the directory The goal of information retrieval is to provide immediate random access to the data The user can specifiy his information need 2 Introduction to Computational Linguistics IR 4 IR vs Database Querying The Ubiquity of IR IR is not the same thing as querying a database Information filtering E mail routing Text categorization Database querying assumes that the data is in a standardized format Detecting information structure Hyperlink generation Topic Information detection screening Portal development and maintenance Transforming all information news articles web sites into a database format is difficult and impossible for large data collections Text retrieval can work with plain unformatted data Introduction to Computational Linguistics IR Question Answering 5 Relevance as Similarity Introduction to Computational Linguistics IR 7 History of IR 1950 Calvin N Moors coins the term Information Retrieval A fundamental idea within IR is A document is relevant to a query if they are similar Similarity can be defined as string matching comparison similar vocabulary same meaning of text 1959 Luhn describes statistical retrieval 1960 Maron and Kuhns define a probabilistic model of IR 1966 Cranfield project defines evaluation measures 1968 Gerard Salton s first book about the SMART retrieval system 1972 Lockheed introduces DIALOG as commercial online service Late 1980 s First PC systems incorporate retrieval Introduction to Computational Linguistics IR 6 Introduction to Computational Linguistics IR 8 History of IR Retrieval Models Early 1990 s Cheap disks lead to the information storage revolution document representations 1992 Westlaw is the first large scale information service using probabilistic retrieval identify relevant information User Mid 1990 s Multi media databases query formulation 1994 The internet and web explosion 1995 IR techniques are incorporated in all kinds of information management applications Introduction to Computational Linguistics IR display documents to the user 9 Retrieval Models 11 Components of a Retrieval Model A retrieval model is an idealization or abstraction of an actual retrieval process The user Search expert e g librarian vs non expert Backgound of the user knowledge of the topic In depth searching vs just wanna get an idea searching Conclusions derived from a model depend on whether the model is a good approximation of the retrieval situation The documents Different languages Semi structured e g HTML or XML vs plain Note that a retrieval model is not the same thing as a retrieval implementation Introduction to Computational Linguistics IR Introduction to Computational Linguistics IR 10 Introduction to Computational Linguistics IR 12 Document Representation Controlled Vocabularies Meta descriptions Field information author title date Key words Predefined Manually extracted by author editor Examples are ACM Computing Classification System An article on Web search engines would probably be classified as H 3 5 where H Information Systems H 3 Information Storage and Retrieval H 3 5 Online Information Services NLM Medical Subject Headings MeSH Yahoo Content automatically identifying what the document is about Introduction to Computational Linguistics IR 13 Document Representation Controlled Vocabulary Free Text Manual Current indexing practice Current indexing practice Introduction to Computational Linguistics IR Introduction to Computational Linguistics IR 15 Manual vs Automatic Indexing Pros of manual indexing Human judgements are most reliable Searching controlled vocabularies is more efficient Automatic Text categorization intelligent IR Text search engines statistical IR Cons of manual indexing Time consuming The person using the retrieval system has to be familiar with the classification system Classification systems are sometimes incoherent 14 Introduction to Computational Linguistics IR 16 Automatic Content Representation Example Bag of Words Using natural language understanding Computationally too expensive in real world settings Coverage Language dependence The resulting representations may be too explicit to deal with the vagueness of a user s information need Scientists have found compelling new evidence of possible ancient microscopic life on Mars derived from magnetic crystals in a meteorite that fell to Earth from the red planet NASA announced on Monday a ancient announced compelling crystals derived earth evidence fell found from 2 have in life magnetic mars meteorite microscopic monday nasa new of on 2 planet possible red scientists that the to Alternative a document is simply an unstructured set of words appearing in it bag of words Introduction to Computational Linguistics IR 17 Introduction to Computational Linguistics IR Bag of Words Approach What is this about A document is an unordered list of words Grammatical information is lost Tokenization What is a word Is White House one or two words Case folding President Bush becomes president bush Stemming or lemmatization Morphological information is thrown away agreements becomes agreement lemmatization or even agree stemming Introduction to Computational Linguistics IR 19 18 added al an and ballots been completed count county 2 even former gore ground had hand have 2 he if in 2 independent lost many miamidade might new not of president presidential requested shows study that the vice votes would An independent study shows former Vice President Al Gore would

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 723 - Information Retrieval

Sign up for free to view:

Please select your school