CU-Boulder CSCI 5417 - Lecture 5 - D2584312

Home> Schools> University of Colorado at Boulder> Computer Science (CSCI) > CSCI 5417> Lecture 5

DOC PREVIEW

CU-Boulder CSCI 5417 - Lecture 5

School name University of Colorado at Boulder

Course Csci 5417- Information Retrieval Systems

Pages 38

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 38 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSCI 5417 Information Retrieval Systems Jim MartinToday 9/6RecapBeyond BooleanScoringRanked RetrievalBack to Term x Document MatricesSlide 8Scoring: Beyond Boolean ANDTerm Frequency: Local WeightPotential ProblemTerm Frequency tft,dGlobal WeightCollection vs. Document FrequencyInverse Document FrequencyReuters RCV1 800K docstf x idf (or tf.idf or tf-idf)Summary: TfxIdfReal-valued term vectorsAssignment 2Assignment 2Sample DocSample QueryQrelsEvaluationAssignmentBack to ScoringDocuments as VectorsWhy turn docs into vectors?IntuitionThe Vector Space ModelCosine SimilarityCosine similarityNormalized vectorsSo...But...Slide 37Next TimeCSCI 5417Information Retrieval SystemsJim MartinLecture 59/6/201101/14/19 2Today 9/6Vector space modelNew homework01/14/19 3RecapWe’ve covered a variety of types of indexesAnd a variety of ways to build indexesAnd a variety of ways to process tokensAnd boolean searchNow what?01/14/19 4Beyond BooleanThus far, our queries have been BooleanDocs either match or they don’tOk for expert users with precise understanding of their needs and the corpusNot good for (the majority of) users with poor Boolean formulation of their needsMost users don’t want to wade through 1000’s of results (or get 0 results)Hence the popularity of search engines which provide a ranking.01/14/19 5ScoringWithout some form of ranking, boolean queries usually result in too many or too few results.With ranking, the number of returned results is irrelevant.The user can start at the top of a ranked list and stop when their information need is satisfied01/14/19 6Ranked RetrievalGiven a query, assign a numerical score to each doc in the collectionReturn documents to the user based on the ranking derived from that scoreHow?A considerable amount of the research in IR over the last 20 years...Extremely empirical in nature01/14/19 7Back to Term x Document MatricesAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Documents and terms can be thought of as vectors of 1’s a 0’sDocuments and terms can be thought of as vectors of 1’s a 0’s01/14/19 8Back to Term x Document MatricesAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1worser 2 0 1 1 1 0Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1worser 2 0 1 1 1 0Consider instead the number of occurrences of a term t in a document d, denoted tft,dConsider instead the number of occurrences of a term t in a document d, denoted tft,d01/14/19 9Scoring: Beyond Boolean ANDGiven a free-text query q and a document d defineThat is, simply add up the term frequencies of all query terms in the documentHolding the query static, this assigns a score to each document in a collection; now rank documents by this score.Score(q,d) = tq tft,dScore(q,d) = tq tft,d01/14/19 10Term Frequency: Local WeightWhat is the relative importance of0 vs. 1 occurrence of a term in a doc1 vs. 2 occurrences2 vs. 3 occurrences …Unclear, but it does seem like more is better, a lot isn’t proportionally better than a fewOne scheme commonly used:€ wft,d= 0 if tft,d= 0, 1+ log tft,d otherwise01/14/19 11Potential ProblemConsider query ides of marchJulius Caesar has 5 occurrences of idesNo other play has idesmarch occurs in over a dozenSO... Julius Caesar should do well since it has counts from both ides and marchBUT all the plays contain of, some many times. So by this scoring measure, the top-scoring play is likely to be the one with the most number of of’sBUT all the plays contain of, some many times. So by this scoring measure, the top-scoring play is likely to be the one with the most number of of’s01/14/19 12Term Frequency tft,dOf is a frequent word overall. Longer docs will have more ofs. But not necessarily more march or idesHence longer docs are favored because they’re more likely to contain frequent query termsProbably not a good thing01/14/19 13Global WeightWhich of these tells you more about a doc?10 occurrences of hernia?10 occurrences of the?Would like to attenuate the weights of common termsBut what does “common” mean?2 options: Look at Collection frequencyThe total number of occurrences of a term in the entire collection of documentsDocument frequency01/14/19 14Collection vs. Document FrequencyConsider... Word cf dftry 10422 8760insurance 10440 3997Word cf dftry 10422 8760insurance 10440 399701/14/19 15Inverse Document FrequencySo how can we formalize that? Terms that appear across a large proportion of the collection are less useful. They don’t distinguish among the docs.So let’s use that proportion as the key.And let’s think of boosting useful terms rather than demoting useless ones. ⎟⎠⎞⎜⎝⎛=dfNidfttlog01/14/19 16Reuters RCV1 800K docsLogarithms are base 1001/14/19 17tf x idf (or tf.idf or tf-idf)We still ought to pay attention to the local weight... soIncreases with the number of occurrences within a docIncreases with the rarity of the term across the whole corpus)/log(,, tdtdtdfNtfw ×= termcontain that documents ofnumber thedocuments ofnumber totaldocument in termoffrequency ,tdfNdttftdt===01/14/19 18Summary: TfxIdf“TFxIDF is usually used to refer to a family of approaches.01/14/19 19Real-valued term vectorsStill Bag of words modelEach is a vector in ℝMHere log-scaled tf.idfAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 13.1 11.4 0.0 0.0 0.0 0.0Brutus 3.0 8.3 0.0 1.0 0.0 0.0Caesar 2.3 2.3 0.0 0.5 0.3 0.3Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0mercy 0.5 0.0 0.7 0.9 0.9 0.3worser 1.2 0.0 0.6 0.6 0.6 0.001/14/19 20Assignment 2 Download and install LuceneHow does Lucene handle (using standard methods)Case, stemming, stop lists and multiword

View Full Document