UW-Madison CS 838 - Information Retrieval - D2124697

Home> Schools> University of Wisconsin, Madison> (CS) > CS 838> Information Retrieval

DOC PREVIEW

UW-Madison CS 838 - Information Retrieval

School name University of Wisconsin, Madison

Course Cs 838- Big Data Systems

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS838-1 Advanced NLP:Information RetrievalXiaojin Zhu2007Send comments to [email protected] Information Retrieval TasksInformation Retrieval (IR) covers many aspe cts of getting information. Themost well-known task is ad hoc retrieval (e.g., Go ogle and Yahoo!), where thetask is to find the most relevant do c uments for a given user query.• The document collection is known and (roughly) fixed. For example thecrawled Web.• The user issues a query, the IR system ranks documents.• In some cases, the user can provide relevance feedback interactively, e.g.by labeling the top documents as good or bad. The system then updatesits ranking based on the label information.• A related technique is called pseudo relevance feedback, where no userinteraction is needed. The system treats the top documents as if theywere all relevant, and reinforces itself. We will see this again in semi-supervised learning later.• Meta search tries to merge the ranked list from multiple IR systems.• Clustering and visualizing search results is important too. Try crane atclusty.com.Information filtering is another task (e.g., Google Alert). The user defines in-formation need beforehand, and the system picks out relevant documents froma continuous stream. IR extends to images, audio and video too.2 IR measuresGiven a query, a document collection can be divided into 2 parts: those trulyrelevant and those not (let us assume it is clear cut). An IR system will retrieve1documents it deems relevant, thus dividing the collection into it-thinks-relevantand it-thinks-not. The two divisions are often not the same! Therefore we havefour counts:truly relevant truely irrelevantretrieved true positive (tp) false positive (fp)not retrieved false negative (fn) true negative (tn)Two common measures gauge the quality of the IR system:• PrecisionP =tptp + fp.How pure are the retrieved doc uments? Obviously we like a precision closeto 1.• Recall An IR system can cheat precision by only retrieving the singlemost confident document. Since the document will very likely be relevant,precision will likely to be 1. However there could b e hundreds of trulyrelevant documents, and a good IR system should get them all. Recallmeasures this byR =tptp + fn.Note recall itself is not a good measure either: the IR system can cheatby returning the whole document collection.For this reason precision and recall is almost always reported as a pair.An IR system usually has many tunable parameters. Different parametersettings result in different P, R value pairs (assuming a fixed doc ument collec-tion). One can plot these pairs on a precision-recall curve, where by conventionx-axis is recall and y-axis is precision. Note the parameters are hidden and notshown. PR curve is often used to compare two IR systems: the one with curveabove (and to the right) is superior.A single number summary of precison and recall is the F-scoreF =2P RP + R,which is the harmonic mean of P, R.There are many other measures, but P, R, F form the basis.3 Language Models for IRThere are many ways to do IR. LM is one of them. Given a query q, we want torank the documents. We introduce a binary label r for ‘relevant’, and computethe probabilityp(r = 1|q, d)2that a given do c ument d is relevant to the query q. Ranking by this probabilitygives what we want. Applying Bayes rulep(r = 1|q, d) =p(d, q|r = 1)p(r = 1)p(q, d). (1)It will be convenient to consider the log odds ratio, which preserves the ranking:logp(r = 1|q, d)p(r = 0|q, d)(2)= logp(d, q|r = 1)p(r = 1)p(d, q|r = 0)p(r = 0)(3)= logp(q|d, r = 1)p(d|r = 1)p(r = 1)p(q|d, r = 0)p(d|r = 0)p(r = 0)(4)= logp(q|d, r = 1)p(r = 1|d)p(q|d, r = 0)p(r = 0|d)(5)= logp(q|d, r = 1)p(q|d, r = 0)+ logp(r = 1|d)p(r = 0|d)(6)We make the assumption thatp(q|d, r = 1)is computed from a (document) unigram language model created from d, andapplied to q. That is, we assume the query is “generated” by the relevantdocument. Note there will be many separate document LMs.We make a second assumption thatp(q|d, r = 0) = p(q|r = 0),which says since d is irrelevant, q, d are conditionally indep endent. It can beignored for ranking.The term logp(r=1|d)p(r=0|d)measures how likely a document d is relevant, withouteven knowing what query will be asked! It can be viewed as an a priori impor-tance measure. It can be a c onstant for all documents, but a more interestingmodel made Google famous, as we discuss later.4 Vector Models for IREach document d is represented by a tf ·idf vector (many other variations exist).tf is normalize term frequency, i.e. word count vector scaled, so that the mostfrequent word type in this document has a value 1:tfw=c(w, d)maxvc(v, d),3where c(w, d) is the count of word w in d. idf is inverse document frequency.Let N be the number of documents in the collection. Let nwbe the number ofdocuments in which word type w appears.idfw= logNnw.idf is an attempt to handle stopwords or near-stopwords: if a word (like “the”)appears in most documents, it is probably not very interesting. Finallytf · idfw= tfw× idfw.The query q is represented by a tf ·idf vector too. Documents are ranked bytheir cosine similarity to q: q and a document d form an angle θ in the tf · idfvector space, andsim(d, q) = cos(θ) (7)=d>qkdk · kqk(8)=d>q√d>dpq>q, (9)where the dot productu>v

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

UW-Madison CS 838 - Information Retrieval

Sign up for free to view:

Please select your school