Basics of Information RetrievalBasic ideasSome distinctionsDatabasesThe WebDigital LibraryIn all casesEffectivenessThe processThe CollectionCrawling the webResponding to search queriesMaking the connectionsThe Vector modelAnatomy of a web pageStandard MetatagsHubs and authoritiesSome Digital Library examplesConclusionsBasics of Information RetrievalLillian N. CasselSome of these slides are taken or adapted fromSource: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.htmlBasic ideasInformation overloadThe challenging byproduct of the information ageHuge amounts of information available -- how to find what you need when you need itThink about addresses, e-mail messages, files of interesting articles, etc.Information retrieval is the formal study of efficient and effective ways to extract the right bit of information from a collection.The web is a special case, as we will discuss.Some distinctionsData, information, knowledgeHow do you distinguish among them?http://www.systems-thinking.org/dikw/dikw.htmInformation sourcesVery well organized, indexed, controlledTotally unorganized, uncharacterized, uncontrolledSomething in betweenDatabasesDatabases hold specific data itemsOrganization is explicitKeys relate items to each otherQueries are constrained, but effective in retrieving the data that is thereDatabases generally respond to specific queries with specific resultsBrowsing is difficultSearching for items not anticipated by the designers can be difficultThe WebThe Web contains many kinds of elementsOrganization?There are no keys to relate items to each otherQueries are unconstrained; effectiveness depends on the tools used.Web queries generally respond to general queries with specific resultsBrowsing is possible, though somewhat complicatedThere are no designers of the overall Web structure.Describe how you frequently use the webWhat works easily?What has been difficult?Digital LibrarySomething in between the very structured database and the unstructured Web.Content is controlled. Someone makes the entries. (Maybe a lot of people make the entries, but there are rules for admission.)Searching and browsing are somewhat open, not controlled by fixed keys and anticipated queries.Nature of the collection regulates indexing somewhat.In all casesTrying to connect an information user to the specific information wanted.Concerned with efficiency and effectivenessEffectiveness - how well did we do?Efficiency - how well did we use available resources?EffectivenessTwo measures:PrecisionOf the results returned, what percentage are meaningful to the goal of the query?RecallOf the materials available that match the query, what percentage were returned?Ex. Search returns 590,000 responses and 195 are relevant. How well did we do?Not enough information. Did the 590,000 include all relevant responses? If so, recall is perfect.195/590,000 is not good precision!The processQuery enteredQuery InterpretedItems retrievedIndex searchedResults RankedThe CollectionWhere does the collection come from?How is the index created?Those are important distinguishing characteristics Inverted Index -- Ordered list of terms related to the collected materials. Each term has an associated pointer to the related material(s).www.cs.cityu.edu.hk/~deng/5286/T51.docCrawling the webMisnomer as the spider or robot does not actually move about the webProgram sends a normal request for the page, just as a browser would.Retrieve the page and parse it.Look for anchors -- pointers to other pages.•Put them on a list of URLs to visitExtract key words (possibly all words) to use as index terms related to that pageTake the next URL and do it againActually, the crawling and processing are parallel activitiesResponding to search queriesUse the query string providedForm a boolean queryJoin all words with AND? With OR?Find the related index termsReturn the information available about the pages that correspond to the query terms.Many variations on how to do this. Usually proprietary to the company.Making the connectionsStemmingMaking sure that simple variations in word form are recognized as equivalent for the purpose of the search: exercise, exercises, exercised, for example.IndexingA keyword or group of selected wordsAny word (more general)How to choose the most relevant terms to use as index elements for a set of documents.Build an inverted file for the chosen index terms.The Vector modelLet,N be the total number of documents in the collectionni be the number of documents which contain kifreq(i,j) raw frequency of ki within djA normalized tf (term frequency) factor is given bytf(i,j) = freq(i,j) / max(freq(i,j))where the maximum is computed over all terms which occur within the document djThe idf (index term frequency) factor is computed asidf(i) = log (N/ni)the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.Anatomy of a web pageMetatags: Information about the pagePrimary source of indexing information for a search engine.Ex. Title. Never mind what has an H1 tag (though that may be considered), what is in the <title> </title> brackets?Other tags provide information about the page. This is easier for the search engine to use than determining the meaning of the text of the page.Dealing with the cheatersFalse information provided in the web page to make the search engine return this pageFalse metatags, invisible words (repeated many times), etcStandard MetatagsThe Dublin Core (http://dublincore.org/)15 common items to use in labeling any web documentTitle Contributor SourceCreator Date LanguageSubject Resources type RelationDescription Format CoveragePublisher Identifier RightsHubs and authoritiesHub points to a lot of other places.CITIDEL is a hub for computing informationNSDL is a hub for science, technology, engineering and mathematics education.Authorities are pointed to by a lot of other places.W3C.org is an authority for information about the web.When Hub or Authority status is captured, the search can be more accurate. If several pages match a query, and one is an authority page, it will be ranked higher.When a hub
View Full Document