DOC PREVIEW
UMD CMSC 723 - Information Retrieval

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Introduction toComputational LinguisticsInformation RetrievalChristof MonzIntroduction to Computational Linguistics: IR 1What is Information Retrieval?• Finding relevant information in large collections of data• In such a collection you may want to find:◮‘Give me information on the history of the Kennedys’An article about the Kennedys (text retrieval)◮‘What does a brain tumor look like on a CT-scan’A picture of a brain tumor (image retrieval)◮‘It goes like this: hmm hmm hahmmm . . . ’A certain song (music retrieval)Introduction to Computational Linguistics: IR 2Text Retrieval• Online library catalogs (OPAC)• Internet search engines, such asAltaVista, Google• Specialized systems (aka vendors):◮MEDLINE (medical articles)◮Lexis-Nexis (legal, business, academic, . . . )◮Westlaw (legal articles)◮Dialog (business information)Introduction to Computational Linguistics: IR 3Retrieval vs. Browsing• Popular Web Directories:◮Yahoo!, Open Directory Project (dmoz)• The user has to ‘guess’ the ‘right’ directories to findthe information◮The user has to adapt to the designers’conceptualization of the directory• The goal of information retrieval is to provideimmediate random access to the data◮The user can specifiy his information needIntroduction to Computational Linguistics: IR 4IR vs. Database Querying• IR is not the same thing as querying a database• Database querying assumes that the data is in astandardized format• Transforming all information, news articles, web sitesinto a database format is difficult an d impossible forlarge data collections• Text retrieval can work with plain, unformatted dataIntroduction to Computational Linguistics: IR 5Relevance as Similarity• A fundamental idea within IR is:‘A document is relevant to a queryif they are similar’• Similarity can be defined as◮string matching/comparison◮similar vocabulary◮same meaning of textIntroduction to Computational Linguistics: IR 6The Ubiquity of IR• Information filtering◮E-mail routing◮Text categorization• Detecting information structure◮Hyperlink generation◮Topic/Information detection/screening◮Portal development and maintenance• Question AnsweringIntroduction to Computational Linguistics: IR 7History of IR• 1950: Calvin N. Moors coins the term ‘Information Retrieval’• 1959: Luhn describes statistical retrieval• 1960: Maron a nd Kuhns define a probabilistic model of IR• 1966: Cranfield project defines evaluation measures• 1968: Gerard Salton ’s first bo ok about the SMART retrievalsystem• 1972: Lockheed introduces DIALOG as commercial online service• Late 1980’s: First PC systems incorporate retrievalIntroduction to Computational Linguistics: IR 8History of IR• Early 1990’s: Cheap disks lead to the infor mation storagerevolution• 1992: Westlaw is the first large-scale information service usingprobabilistic retrieval• Mid 1990’s: Multi-media databases• 1994: The internet and web explosion• 1995: IR techniques are incorporated in all kinds of informationmanagement ap plicationsIntroduction to Computational Linguistics: IR 9Retrieval Models• A retrieval model is an idealization or abstraction of anactual retrieval process• Conclusions derived from a model depend on whetherthe model is a good approximation of the retrievalsituation• Note that a retrieval model is not the same thing as aretrieval implementationIntroduction to Computational Linguistics: IR 10Retrieval Modelsdisplay documentsto the userqueryformulationidentify relevantinformationdocumentUserrepresentationsIntroduction to Computational Linguistics: IR 11Components of a Retrieval Model• The user:◮Search expert (e.g., librarian) vs. non-expert◮Backgound of the user (knowledge of the topic)◮In-depth searching vs. ‘just-wanna-get-an-idea’searching• The documents:◮Different languages◮Semi-structured (e.g. HTML or XML) vs. plainIntroduction to Computational Linguistics: IR 12Document Representation• Meta-descriptions◮Field information (author, title, date)◮Key words- Predefined- Manually extracted (by author/editor)• Content: automatically identifying what the documentis aboutIntroduction to Computational Linguistics: IR 13Document RepresentationManual AutomaticControlled Current indexing Text categorizationVocabulary practice ‘intelligent’ IRCurrent indexing Text search enginesFree Textpractice ‘statistical’ IRIntroduction to Computational Linguistics: IR 14Controlled Vocabularies• Examples are:◮ACM Computing Classification SystemAn article on Web search engines would (probably)be classified as H.3.5 where:- H: Information Systems- H.3: Information Storage and Retrieval- H.3.5: Online Information Services◮NLM Medical Subject Headings (MeSH)◮Yahoo!Introduction to Computational Linguistics: IR 15Manual vs. Automatic Indexing• Pros of manual indexing:+ Human judgements are most reliable+ Searching controlled vocabularies is more efficient• Cons of manual indexing:− Time consuming− The person using the retrieval system has to befamiliar with the classification system− Classification systems are sometimes incoherentIntroduction to Computational Linguistics: IR 16Automatic Content Representation• Using natural language understanding?◮Computationally too expensive in real-world settings◮Coverage◮Language dependence◮The resulting representations may be too explicit todeal with the vagueness of a user’s information need• Alternative: a document is simply an unstructured setof words appearing in it: bag of wordsIntroduction to Computational Linguistics: IR 17Bag-of-Words Approach• A document is an unordered list of wordsGrammatical information is lost• Tokenization: What is a word?Is ‘White House’ one or two words?• Case folding‘President Bush’ becomes ‘president’ , ‘bush’• Stemming or lemmatizationMorphological information is thrown away‘agreements’ becomes ‘agreement’ (lemmatization)or even ‘agree’ (stemming)Introduction to Computational Linguistics: IR 18Example Bag of WordsScientists have found compelling new evidence of possibleancient microscopic life on Mars, derived from magneticcrystals in a meteorite that fell to Earth from the red planet,NASA a nnou nced on Monday.a, ancient, announced, compelling, crystals, derived, earth,evidence, fell, found, from (2×), have, in, life, magnetic,mars, meteorite, microscopic, monday, nasa, new, of,on


View Full Document

UMD CMSC 723 - Information Retrieval

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Download Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?