Mt Holyoke CS 336 - Intelligent Information Retrieval

Unformatted text preview:

Intelligent Information Retrieval CS 336What is Information Retrieval?IR Through the AgesSlide 4Slide 5Historical SummarySlide 7Slide 8Slide 9Slide 10Slide 11Information RetrievalSlide 13Slide 14Slide 15Slide 16Information Retrieval and FilteringFeatures of a Modern IR ProductTypical SystemsIR vs. Database SystemsSlide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Announcement:Intelligent Information RetrievalCS 336Xiaoyan LiSpring 2006Modified from Lisa Ballesteros’s slidesWhat is Information Retrieval?•Includes the following:–Organization–Storage/Representation–Manipulation/Analysis–Search/Retrieval•How far back in history can we find examples?IR Through the Ages•3rd Century BCE–Library of Alexandria•500,000 volumes•catalogs and classifications•13th Century A.D.–First concordance of the Bible•What is a concordance?•15th Century A.D.–Invention of printing•1600–University of Oxford Library•All books printed in EnglandIR Through the Ages•1755–Johnson’s Dictionary•Set standard for dictionaries•Included common language•Helped standardize spelling•1800–Library of Congress•1828–Webster’s Dictionary•Significantly larger than previous dictionaries•Standardized American spelling•1852–Roget’s ThesaurusIR Through the Ages•1876–Dewey Decimal Classification•1880’s–Carnegie Public Libraries•1,681 built (first public library 1850)•1930’s–Punched card retrieval systems•1940’s–Bush’s Memex–Shannon’s Communication Theory–Zipf’s “Law”Historical Summary•1960’s–Basic advances in retrieval and indexing techniques•1970’s–Probabilistic and vector space models–Clustering, relevance feedback–Large, on-line, Boolean information services–Fast string matching•1980’s–Natural Language Processing and IR–Expert systems and IR–Off-the-shelf IR systemsIR Through the Ages•Late 1980’s–First mini-computer and PC systems incorporating “relevance ranking”•Early 1990’s –information storage revolution•1992–First large-scale information service incorporating probabilistic retrieval (West’s legal retrieval system)IR Through the Ages•Mid 1990’s to present–Multimedia databases•1994 to present–The Internet and Web explosion•e.g. Google, Yahoo, Lycos, Infoseek (now Go)•1995 to present–Digital Libraries–Data Mining–Agents and Filtering–Knowledge and Distributed Intelligence–Information Organization–Knowledge ManagementHistorical Summary•1990’s–Large-scale, full-text IR and filtering experiments and systems (TREC)–Dominance of ranking–Many web-based retrieval engines–Interfaces and browsing–Multimedia and multilingual–Machine learning techniquesTimeOn-lineInformation19901970Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia...GigabytesTerabytesPetabytesTechnol ogiesBoolean Retrieval and FilteringRanked RetrievalDistributed RetrievalConcept-Based RetrievalImage and VideoRetrievalInformation ExtractionVisualizationSummarizationData Mining Ranked FilteringTrends in IR Technology1-page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs1 petabyte = one-thousand terabytes.Historical Summary•The Future–Logic-based IR?–NLP?–Integration with other functionality–Distributed, heterogeneous database access –IR in context–“Anytime, Anywhere”Information Retrieval•Ad Hoc Retrieval –Given a query and a large database of text objects, find the relevant objects•Distributed Retrieval–Many distributed databases•Information Filtering–Given a text object from an information stream (e.g. newswire) and many profiles (long-term queries), decide which profiles match•Multimedia Retrieval–Databases of other types of unstructured data, e.g. images, video, audioInformation Retrieval•Multilingual Retrieval–Retrieval in a language other than English•Cross-language Retrieval–Query in one language (e.g. Spanish), retrieve documents in other languages (e.g. Chinese, French, and Spanish)What does an IR system do?•Generate a representation of each document–essentially pick best words and/or phrases •Generate query representation–if documents processed specially, queries must also be–possibly weight query words•Match queries and documents–find relevant documents•Perhaps, rank and sort documentsInformation Retrieval•Text Representation (Indexing)–given a text document, identify the concepts that describe the content and how well they describe it•what makes a “good” representation?•how is a representation generated from text?•what are retrievable objects and how are they organized?•Representing an Information Need (Query Formulation)–describe and refine information needs as explicit queries•what is an appropriate query language?•how can interactive query formulation and refinement be supported?Information Retrieval•Comparing Representations (Retrieval)–compare text and information need representations to determine which documents are likely to be relevant•what is a “good” model of retrieval?•how is uncertainty represented?•Evaluating Retrieved Text (Feedback)–present documents for user evaluation and modify query based on feedback•what are good metrics?•what constitutes a good experimental testbedInformation Retrieval and FilteringInformation Need Text ObjectsRepresentationQueryComparisonEvaluation/FeedbackIndexed ObjectsRetrieved ObjectsRepresentationFeatures of a Modern IR Product•Effective “relevance ranking”•Simple free text (“natural language”) query capability•Boolean and proximity operators•Term weighting•Query formulation assistance•Query by example•Filtering•Field-based retrieval•Distributed architecture•Index anything•Fast retrieval•Information OrganizationTypical Systems•IR systems–Verity, Fulcrum, Excalibur•Database systems–Oracle, Informix•Web search and In-house systems–West, LEXIS/NEXIS, Dialog–Yahoo, Google, MSN, AskJeevesIR vs. Database Systems•Emphasis on effective, efficient retrieval of unstructured data•IR systems typically have very simple schemas•Query languages emphasize free text although Boolean combinations of words is also commonIR vs. Database Systems•Matching is more complex than with structured data (semantics less obvious)–easy to retrieve the wrong


View Full Document

Mt Holyoke CS 336 - Intelligent Information Retrieval

Documents in this Course
Load more
Download Intelligent Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Intelligent Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Intelligent Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?