Unformatted text preview:

Information RetrievalUnstructured (text) vs. structured (database) data in 1996Unstructured (text) vs. structured (database) data in 2006Structured vs unstructured dataUnstructured dataSemi-structured dataWhat is IR?Ultimate Focus of IRInformation Need : Query, RelevancyDIKW HierarchySlide 11Slide 12Slide 13Information vs Data RetrievalUser TaskLogical View of DocumentsSlide 17IR BasicsClustering and classificationThe web and its challengesMore sophisticated semi-structured searchMore sophisticated information retrievalFuture Progress: Factors/TrendsPrasad L1IntroIR 1Information RetrievalAdapted from Lectures byBerthier Ribeiro-Neto (Brazil),Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)Prasad L1IntroIR 2Unstructured (text) vs. structured (database) data in 1996Prasad L1IntroIR 3Unstructured (text) vs. structured (database) data in 2006Prasad L1IntroIR 4Structured vs unstructured data•Structured data : information in “tables”Employee Manager SalarySmith Jones 50000Chang Smith 6000050000Ivy SmithTypically allows numerical range and exact match(for text) queries, e.g.,Salary < 60000 AND Manager = Smith.Prasad L1IntroIR 5Unstructured data•Typically refers to free textData which does not have clear, semantically overt, easy-for-a-computer structure•AllowsKeyword-based queries including operatorsMore sophisticated “concept” queries, e.g.,•find all web pages dealing with drug abusePrasad L1IntroIR 6Semi-structured data•In fact almost no data is “unstructured”E.g., this slide has distinctly identified zones such as the Title and Bullets•Facilitates “semi-structured” search such asTitle contains data AND Bullets contain search… to say nothing of linguistic structurePrasad L1IntroIR 7What is IR?•Representation•Keywords/Phrases, Structure/Fonts, Counts, etc•Organization and Storage•Inverted File Index, Compressed, etc•Hardware Architecture and Memory Hierarchy•Access to information items•Interface : Spell-checker to tree-structured display•Visualization : Labeled Clusters, Timelines, Spring graphs, etc.Prasad L1IntroIR 8Ultimate Focus of IR•Satisfying user information needEmphasis is on retrieval of information (not data)•User information need : ExamplesPrinter reviewsPrinter prices and availabilityWords in which all vowels appearAnagram/Permutations of art•Predicting which documents are relevant, and then linearly ranking them.Prasad L1IntroIR 9Information Need : Query, Relevancy•An information need is the topic about which the user desires to know more, and is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need. •A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need.Prasad L1IntroIR 10DIKW Hierarchy•Data: Symbolic units E.g., Records of customer.E.g., Bytes from sensors. •Information : Data with an interpretation (Who?, What?, When?, Where?). E.g., Records of current/new customer grouped by their ages. E.g., Variation in temperature readings.Prasad L1IntroIR 11DIKW Hierarchy•Knowledge : Information organized with theoretical concepts or abstract ideas (How?)E.g., How many customers have cancelled the accounts in current fiscal year? E.g., Analysis of temperature variation over the years and their causes.•Wisdom : Understanding of fundamental principles + Human JudgementE.g., What strategies can be employed to retain customers in the face of cheaper alternatives? E.g., Global warming issues and the future of Earth.Prasad L1IntroIR 12DataInformationKnowledgeWisdomUnderstandingContextResearching Absorbing Doing Interacting ReflectingJoining ofwholesFormationof a wholeConnectionof partsGatheringof partsPastFutureExperienceNoveltyDIKW hierarchy: Clark 2004Prasad L1IntroIR 13You see things; and you say "Why?" But I dream things that never were; and I say "Why not?" George Bernard ShawPrasad L1IntroIR 14Information vs Data Retrieval•Unstructured : open to interpretation•Usually incomplete or ambiguous (w.r.t information need)•Partial match allowed, relevance-based ranking•Probabilistic underpinnings•Library•Structured with well-defined semantics•Well-defined semantics•Exact match required - no or many results•Foundations: Algebra/Logic•Accounting•DATA: •QUERY : •QUALITY OF RESULTS:•FOUNDATIONS:•APPLICATION:Prasad L1IntroIR 15User TaskRetrieval•Purposeful – HP Multifunction Printer InformationBrowsing•Casual – Big Bang, CBR, Element Genesis, Supernova, ...•Hyperlink-based Filtering by Agents•Push – Podcasts from B.B.C’s Naked ScienceRetrievalBrowsingDatabasePrasad L1IntroIR 16Logical View of Documents•Abstraction (essentials)Structure, fonts, proximity, repetitions, etcstructureAccentsspacingstopwordsNoungroupsstemmingManual indexingDocsstructure Full text Index termsPrasad L1IntroIR 17UserInterface Text OperationsQuery OperationsIndexingSearchingRankingIndexTextqueryuser needuser feedbackranked docs retrieved docslogical viewlogical viewinverted fileDB Manager Module4, 106, 75 828Text DatabaseTextThe Retrieval ProcessPrasad L1IntroIR 18IR Basics •Models and retrieval evaluation•Query languages and operations •Improve inferring query context –(query expansion, relevance feedback)•Text operations•Improve gleaning of document semantics–(stemming keywords)•Efficient Access: Index and SearchVisualization, Multimedia, Applications, …Prasad L1IntroIR 19Clustering and classification•Given a set of docs, group them into clusters based on their content.•Given a set of topics, plus a new doc D, decide which topic(s) D belongs to.Prasad L1IntroIR 20The web and its challenges•Unusual and diverse documents•Unusual and diverse users, queries, information needs•Beyond terms, exploit ideas from social networkslink analysis, clickstreams, ...•How do search engines work? And how can we make them better?Prasad L1IntroIR 21More sophisticated semi-structured search•Title is about Object Oriented Programming AND Author something like stro*rup where * is the wild-card operator•Issues:how do you process “about”?how do you rank results?•The focus of XML search.Prasad L1IntroIR 22More sophisticated information retrieval•Cross-language information retrieval•Question


View Full Document

Wright CS 707 - Information Retrieval

Download Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?