DOC PREVIEW
Villanova CSC 9010 - Basics of Information Retrieval

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Basics of Information RetrievalBasic ideasSome distinctionsDatabasesThe WebDigital LibraryIn all casesEffectivenessThe processThe CollectionCrawling the webResponding to search queriesMaking the connectionsThe Vector modelAnatomy of a web pageStandard MetatagsHubs and authoritiesSome Digital Library examplesConclusionsBasics of Information RetrievalLillian N. CasselSome of these slides are taken or adapted fromSource: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.htmlBasic ideasInformation overloadThe challenging byproduct of the information ageHuge amounts of information available -- how to find what you need when you need itThink about addresses, e-mail messages, files of interesting articles, etc.Information retrieval is the formal study of efficient and effective ways to extract the right bit of information from a collection.The web is a special case, as we will discuss.Some distinctionsData, information, knowledgeHow do you distinguish among them?http://www.systems-thinking.org/dikw/dikw.htmInformation sourcesVery well organized, indexed, controlledTotally unorganized, uncharacterized, uncontrolledSomething in betweenDatabasesDatabases hold specific data itemsOrganization is explicitKeys relate items to each otherQueries are constrained, but effective in retrieving the data that is thereDatabases generally respond to specific queries with specific resultsBrowsing is difficultSearching for items not anticipated by the designers can be difficultThe WebThe Web contains many kinds of elementsOrganization?There are no keys to relate items to each otherQueries are unconstrained; effectiveness depends on the tools used.Web queries generally respond to general queries with specific resultsBrowsing is possible, though somewhat complicatedThere are no designers of the overall Web structure.Describe how you frequently use the webWhat works easily?What has been difficult?Digital LibrarySomething in between the very structured database and the unstructured Web.Content is controlled. Someone makes the entries. (Maybe a lot of people make the entries, but there are rules for admission.)Searching and browsing are somewhat open, not controlled by fixed keys and anticipated queries.Nature of the collection regulates indexing somewhat.In all casesTrying to connect an information user to the specific information wanted.Concerned with efficiency and effectivenessEffectiveness - how well did we do?Efficiency - how well did we use available resources?EffectivenessTwo measures:PrecisionOf the results returned, what percentage are meaningful to the goal of the query?RecallOf the materials available that match the query, what percentage were returned?Ex. Search returns 590,000 responses and 195 are relevant. How well did we do?Not enough information. Did the 590,000 include all relevant responses? If so, recall is perfect.195/590,000 is not good precision!The processQuery enteredQuery InterpretedItems retrievedIndex searchedResults RankedThe CollectionWhere does the collection come from?How is the index created?Those are important distinguishing characteristics Inverted Index -- Ordered list of terms related to the collected materials. Each term has an associated pointer to the related material(s).www.cs.cityu.edu.hk/~deng/5286/T51.docCrawling the webMisnomer as the spider or robot does not actually move about the webProgram sends a normal request for the page, just as a browser would.Retrieve the page and parse it.Look for anchors -- pointers to other pages.•Put them on a list of URLs to visitExtract key words (possibly all words) to use as index terms related to that pageTake the next URL and do it againActually, the crawling and processing are parallel activitiesResponding to search queriesUse the query string providedForm a boolean queryJoin all words with AND? With OR?Find the related index termsReturn the information available about the pages that correspond to the query terms.Many variations on how to do this. Usually proprietary to the company.Making the connectionsStemmingMaking sure that simple variations in word form are recognized as equivalent for the purpose of the search: exercise, exercises, exercised, for example.IndexingA keyword or group of selected wordsAny word (more general)How to choose the most relevant terms to use as index elements for a set of documents.Build an inverted file for the chosen index terms.The Vector modelLet,N be the total number of documents in the collectionni be the number of documents which contain kifreq(i,j) raw frequency of ki within djA normalized tf (term frequency) factor is given bytf(i,j) = freq(i,j) / max(freq(i,j))where the maximum is computed over all terms which occur within the document djThe idf (index term frequency) factor is computed asidf(i) = log (N/ni)the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.Anatomy of a web pageMetatags: Information about the pagePrimary source of indexing information for a search engine.Ex. Title. Never mind what has an H1 tag (though that may be considered), what is in the <title> </title> brackets?Other tags provide information about the page. This is easier for the search engine to use than determining the meaning of the text of the page.Dealing with the cheatersFalse information provided in the web page to make the search engine return this pageFalse metatags, invisible words (repeated many times), etcStandard MetatagsThe Dublin Core (http://dublincore.org/)15 common items to use in labeling any web documentTitle Contributor SourceCreator Date LanguageSubject Resources type RelationDescription Format CoveragePublisher Identifier RightsHubs and authoritiesHub points to a lot of other places.CITIDEL is a hub for computing informationNSDL is a hub for science, technology, engineering and mathematics education.Authorities are pointed to by a lot of other places.W3C.org is an authority for information about the web.When Hub or Authority status is captured, the search can be more accurate. If several pages match a query, and one is an authority page, it will be ranked higher.When a hub


View Full Document

Villanova CSC 9010 - Basics of Information Retrieval

Documents in this Course
Lecture 2

Lecture 2

48 pages

Lecture 2

Lecture 2

46 pages

Load more
Download Basics of Information Retrieval
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Basics of Information Retrieval and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Basics of Information Retrieval 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?