CSCI 5417 Information Retrieval Systems Jim MartinToday 9/6RecapBeyond BooleanScoringRanked RetrievalBack to Term x Document MatricesSlide 8Scoring: Beyond Boolean ANDTerm Frequency: Local WeightPotential ProblemTerm Frequency tft,dGlobal WeightCollection vs. Document FrequencyInverse Document FrequencyReuters RCV1 800K docstf x idf (or tf.idf or tf-idf)Summary: TfxIdfReal-valued term vectorsAssignment 2Assignment 2Sample DocSample QueryQrelsEvaluationAssignmentBack to ScoringDocuments as VectorsWhy turn docs into vectors?IntuitionThe Vector Space ModelCosine SimilarityCosine similarityNormalized vectorsSo...But...Slide 37Next TimeCSCI 5417Information Retrieval SystemsJim MartinLecture 59/6/201101/14/19 2Today 9/6Vector space modelNew homework01/14/19 3RecapWe’ve covered a variety of types of indexesAnd a variety of ways to build indexesAnd a variety of ways to process tokensAnd boolean searchNow what?01/14/19 4Beyond BooleanThus far, our queries have been BooleanDocs either match or they don’tOk for expert users with precise understanding of their needs and the corpusNot good for (the majority of) users with poor Boolean formulation of their needsMost users don’t want to wade through 1000’s of results (or get 0 results)Hence the popularity of search engines which provide a ranking.01/14/19 5ScoringWithout some form of ranking, boolean queries usually result in too many or too few results.With ranking, the number of returned results is irrelevant.The user can start at the top of a ranked list and stop when their information need is satisfied01/14/19 6Ranked RetrievalGiven a query, assign a numerical score to each doc in the collectionReturn documents to the user based on the ranking derived from that scoreHow?A considerable amount of the research in IR over the last 20 years...Extremely empirical in nature01/14/19 7Back to Term x Document MatricesAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0Documents and terms can be thought of as vectors of 1’s a 0’sDocuments and terms can be thought of as vectors of 1’s a 0’s01/14/19 8Back to Term x Document MatricesAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1worser 2 0 1 1 1 0Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 157 73 0 0 0 0Brutus 4 157 0 1 0 0Caesar 232 227 0 2 1 1Calpurnia 0 10 0 0 0 0Cleopatra 57 0 0 0 0 0mercy 2 0 3 5 5 1worser 2 0 1 1 1 0Consider instead the number of occurrences of a term t in a document d, denoted tft,dConsider instead the number of occurrences of a term t in a document d, denoted tft,d01/14/19 9Scoring: Beyond Boolean ANDGiven a free-text query q and a document d defineThat is, simply add up the term frequencies of all query terms in the documentHolding the query static, this assigns a score to each document in a collection; now rank documents by this score.Score(q,d) = tq tft,dScore(q,d) = tq tft,d01/14/19 10Term Frequency: Local WeightWhat is the relative importance of0 vs. 1 occurrence of a term in a doc1 vs. 2 occurrences2 vs. 3 occurrences …Unclear, but it does seem like more is better, a lot isn’t proportionally better than a fewOne scheme commonly used:€ wft,d= 0 if tft,d= 0, 1+ log tft,d otherwise01/14/19 11Potential ProblemConsider query ides of marchJulius Caesar has 5 occurrences of idesNo other play has idesmarch occurs in over a dozenSO... Julius Caesar should do well since it has counts from both ides and marchBUT all the plays contain of, some many times. So by this scoring measure, the top-scoring play is likely to be the one with the most number of of’sBUT all the plays contain of, some many times. So by this scoring measure, the top-scoring play is likely to be the one with the most number of of’s01/14/19 12Term Frequency tft,dOf is a frequent word overall. Longer docs will have more ofs. But not necessarily more march or idesHence longer docs are favored because they’re more likely to contain frequent query termsProbably not a good thing01/14/19 13Global WeightWhich of these tells you more about a doc?10 occurrences of hernia?10 occurrences of the?Would like to attenuate the weights of common termsBut what does “common” mean?2 options: Look at Collection frequencyThe total number of occurrences of a term in the entire collection of documentsDocument frequency01/14/19 14Collection vs. Document FrequencyConsider... Word cf dftry 10422 8760insurance 10440 3997Word cf dftry 10422 8760insurance 10440 399701/14/19 15Inverse Document FrequencySo how can we formalize that? Terms that appear across a large proportion of the collection are less useful. They don’t distinguish among the docs.So let’s use that proportion as the key.And let’s think of boosting useful terms rather than demoting useless ones. ⎟⎠⎞⎜⎝⎛=dfNidfttlog01/14/19 16Reuters RCV1 800K docsLogarithms are base 1001/14/19 17tf x idf (or tf.idf or tf-idf)We still ought to pay attention to the local weight... soIncreases with the number of occurrences within a docIncreases with the rarity of the term across the whole corpus)/log(,, tdtdtdfNtfw ×= termcontain that documents ofnumber thedocuments ofnumber totaldocument in termoffrequency ,tdfNdttftdt===01/14/19 18Summary: TfxIdf“TFxIDF is usually used to refer to a family of approaches.01/14/19 19Real-valued term vectorsStill Bag of words modelEach is a vector in ℝMHere log-scaled tf.idfAntony and Cleopatra Julius Caesar The Tempest Hamlet Othello MacbethAntony 13.1 11.4 0.0 0.0 0.0 0.0Brutus 3.0 8.3 0.0 1.0 0.0 0.0Caesar 2.3 2.3 0.0 0.5 0.3 0.3Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0mercy 0.5 0.0 0.7 0.9 0.9 0.3worser 1.2 0.0 0.6 0.6 0.6 0.001/14/19 20Assignment 2 Download and install LuceneHow does Lucene handle (using standard methods)Case, stemming, stop lists and multiword
View Full Document