Introduction toComputational LinguisticsInformation RetrievalChristof MonzIntroduction to Computational Linguistics: IR 1What is Information Retrieval?• Finding relevant information in large collections of data• In such a collection you may want to find:◮‘Give me information on the history of the Kennedys’An article about the Kennedys (text retrieval)◮‘What does a brain tumor look like on a CT-scan’A picture of a brain tumor (image retrieval)◮‘It goes like this: hmm hmm hahmmm . . . ’A certain song (music retrieval)Introduction to Computational Linguistics: IR 2Text Retrieval• Online library catalogs (OPAC)• Internet search engines, such asAltaVista, Google• Specialized systems (aka vendors):◮MEDLINE (medical articles)◮Lexis-Nexis (legal, business, academic, . . . )◮Westlaw (legal articles)◮Dialog (business information)Introduction to Computational Linguistics: IR 3Retrieval vs. Browsing• Popular Web Directories:◮Yahoo!, Open Directory Project (dmoz)• The user has to ‘guess’ the ‘right’ directories to findthe information◮The user has to adapt to the designers’conceptualization of the directory• The goal of information retrieval is to provideimmediate random access to the data◮The user can specifiy his information needIntroduction to Computational Linguistics: IR 4IR vs. Database Querying• IR is not the same thing as querying a database• Database querying assumes that the data is in astandardized format• Transforming all information, news articles, web sitesinto a database format is difficult an d impossible forlarge data collections• Text retrieval can work with plain, unformatted dataIntroduction to Computational Linguistics: IR 5Relevance as Similarity• A fundamental idea within IR is:‘A document is relevant to a queryif they are similar’• Similarity can be defined as◮string matching/comparison◮similar vocabulary◮same meaning of textIntroduction to Computational Linguistics: IR 6The Ubiquity of IR• Information filtering◮E-mail routing◮Text categorization• Detecting information structure◮Hyperlink generation◮Topic/Information detection/screening◮Portal development and maintenance• Question AnsweringIntroduction to Computational Linguistics: IR 7History of IR• 1950: Calvin N. Moors coins the term ‘Information Retrieval’• 1959: Luhn describes statistical retrieval• 1960: Maron a nd Kuhns define a probabilistic model of IR• 1966: Cranfield project defines evaluation measures• 1968: Gerard Salton ’s first bo ok about the SMART retrievalsystem• 1972: Lockheed introduces DIALOG as commercial online service• Late 1980’s: First PC systems incorporate retrievalIntroduction to Computational Linguistics: IR 8History of IR• Early 1990’s: Cheap disks lead to the infor mation storagerevolution• 1992: Westlaw is the first large-scale information service usingprobabilistic retrieval• Mid 1990’s: Multi-media databases• 1994: The internet and web explosion• 1995: IR techniques are incorporated in all kinds of informationmanagement ap plicationsIntroduction to Computational Linguistics: IR 9Retrieval Models• A retrieval model is an idealization or abstraction of anactual retrieval process• Conclusions derived from a model depend on whetherthe model is a good approximation of the retrievalsituation• Note that a retrieval model is not the same thing as aretrieval implementationIntroduction to Computational Linguistics: IR 10Retrieval Modelsdisplay documentsto the userqueryformulationidentify relevantinformationdocumentUserrepresentationsIntroduction to Computational Linguistics: IR 11Components of a Retrieval Model• The user:◮Search expert (e.g., librarian) vs. non-expert◮Backgound of the user (knowledge of the topic)◮In-depth searching vs. ‘just-wanna-get-an-idea’searching• The documents:◮Different languages◮Semi-structured (e.g. HTML or XML) vs. plainIntroduction to Computational Linguistics: IR 12Document Representation• Meta-descriptions◮Field information (author, title, date)◮Key words- Predefined- Manually extracted (by author/editor)• Content: automatically identifying what the documentis aboutIntroduction to Computational Linguistics: IR 13Document RepresentationManual AutomaticControlled Current indexing Text categorizationVocabulary practice ‘intelligent’ IRCurrent indexing Text search enginesFree Textpractice ‘statistical’ IRIntroduction to Computational Linguistics: IR 14Controlled Vocabularies• Examples are:◮ACM Computing Classification SystemAn article on Web search engines would (probably)be classified as H.3.5 where:- H: Information Systems- H.3: Information Storage and Retrieval- H.3.5: Online Information Services◮NLM Medical Subject Headings (MeSH)◮Yahoo!Introduction to Computational Linguistics: IR 15Manual vs. Automatic Indexing• Pros of manual indexing:+ Human judgements are most reliable+ Searching controlled vocabularies is more efficient• Cons of manual indexing:− Time consuming− The person using the retrieval system has to befamiliar with the classification system− Classification systems are sometimes incoherentIntroduction to Computational Linguistics: IR 16Automatic Content Representation• Using natural language understanding?◮Computationally too expensive in real-world settings◮Coverage◮Language dependence◮The resulting representations may be too explicit todeal with the vagueness of a user’s information need• Alternative: a document is simply an unstructured setof words appearing in it: bag of wordsIntroduction to Computational Linguistics: IR 17Bag-of-Words Approach• A document is an unordered list of wordsGrammatical information is lost• Tokenization: What is a word?Is ‘White House’ one or two words?• Case folding‘President Bush’ becomes ‘president’ , ‘bush’• Stemming or lemmatizationMorphological information is thrown away‘agreements’ becomes ‘agreement’ (lemmatization)or even ‘agree’ (stemming)Introduction to Computational Linguistics: IR 18Example Bag of WordsScientists have found compelling new evidence of possibleancient microscopic life on Mars, derived from magneticcrystals in a meteorite that fell to Earth from the red planet,NASA a nnou nced on Monday.a, ancient, announced, compelling, crystals, derived, earth,evidence, fell, found, from (2×), have, in, life, magnetic,mars, meteorite, microscopic, monday, nasa, new, of,on
View Full Document