CORNELL CS 630 - Basic Language Modeling Approach - D1138313

Home> Schools> Cornell University> Computer Science (CS) > CS 630> Basic Language Modeling Approach

CORNELL CS 630 - Basic Language Modeling Approach

Course Cs 630- Representing and Accessing Digital Information

Pages 8

Download Save

Unformatted text preview:

CS 630 Lecture 8: 02/21/2006 Lecturer: Lillian Lee Scribes: Peter Babinski, David Lin ________________________________________________________________________ Basic Language Modeling Approach I. Special Case of LM-based Approach a. Recap of Formulas and Terms b. Fixing θj? c. About that Multinomial Model… II. A New Interpretation of our LM Approach a. A New Scenario III. Related Questions I. Special Case of LM-based Approach a. Recap of Formulas and Terms Recall our special case of a language modeling-based approach. We decided to use multinomial “topic” models with corpus-dependent Dirichlet priors. We wanted to score a document d by using the probability of the query based on a model based on the document,()qPd. Using a multinomial model gives us the following equation for()qPd, with respect to same length term sequences. (1) ()()[]()qtfjjdjdqKqP∏×=θ)( )(qK represents the number of ways we can rearrange the terms of the query. Recall the following equation for ()djθ: (2) ()djϑ = ()()+×+µµdCCtfdtfjj Recall that we incorporated this smoothing term into ()djθ: (3) ()CCtfj×µ We added (3) to the equation for ()djθ in order to avoid situations where we would have zero counts for a particular term v(j) in our vector representation of the document. Having zero counts for any component of the product sum in the equation for ()qPd for which any of the corresponding counts in the query are non-zero would naturally cause our probability to become zero, so we added counts to each term based upon its frequency of occurrence in the corpus. 1We then remove the document independent term, )(qK, because it does not affect our ranking of documents, in order to obtain the following scoring equation for ()qPd: (4) ( )()()()qtfjjjdjdCCtfdtfqP∏+×+⋅→µµrank Notice that many familiar terms appear in (4). ()dtfj is our term frequency component. d is our length normalization adjustment. ()CCtfj looks like some form of a reciprocal IDF term. This new equation seems a bit problematic. The smoothing function in the equation adds counts to a term proportional to its frequency of occurrence, the exact opposite of the way in which the IDF measure weights terms. A term such as “the” would gain many counts from this smoothing function. We would have preferred to see a term resembling the IDF quantity appear. b. Fixing θj? There are a few ways we could try to fix θj , shown in (2): • We could remove the smoothing term (setµ = 0), but this would result in our estimation treating a document that is missing one query term identically to how it treats documents missing more than one (or even all) query terms. • We could also change the smoothing term to incorporate a uniform prior. The resulting LM estimation would be different, and it doesn’t take advantage of all of the information we have available from the corpus. In addition, we have little justification for this choice. • We could back off from the generative approach entirely. However, before we decide to take any action, shouldn’t we see if there is any way that we can manipulate (4) to give us the IDF term we want? In other words, is the IDF term already in (4)? Our approach, following Zhai and Lafferty (2001), will be to try to drive ()Ctfj into the denominator. First we define the following: (5) ( ) ( )()qdqdnormdnormµµ+== ,, 2Using this quantity, we can rewrite (4): (6) ( ) ( ) ( )[]( )qtfjjjdjCCtfdtfdnormqP∏×+×=µ)(1 Recall our RSJ derivation. We can use some simple transformations to try to separate missing terms. We begin by considering splitting the product sum according to the indices j. One product sum will only include indices j such that ()0>dtfj. The other product sum will only include indices j such that ()0=dtfj. Thus we obtain the following equation from (6): (7) ( ) ( )[]( )( )( )[]( )( )qtfdtfjjqtfdtfjjjjjjjCCtfCCtfdtfdnorm∏∏=>×××+×0:0:)(1µµ Notice that the RHS of (7) is independent of d except for in the index. We can eliminate this RHS by multiplying by this term: (8) ()[]( )()qtfdtfjjjjCCtf∏>×0:µ Because we now have a product sum over all indices j of this term that is independent of d, we can eliminate it under ranking. (9) ()[]( )()()[]( )()()[]()qtfjjqtfdtfjjqtfdtfjjjjjjjCCtfCCtfCCtf∏∏∏×=×××>=µµµ0:0: The resulting term in (9) will not affect our ranking, so we can eliminate it. However, we also need to multiply by the reciprocal of (8) in order to maintain equality. (10)( ) ( )[]( )( )( )[]( )( )××××+×∏∏=>qtfdtfjjqtfdtfjjjjjjjCCtfCCtfdtfdnorm0:0:)(1µµ()[]( )()()[]( )()qtfdtfjjqtfdtfjjjjjjCtfCCCtf∏∏>>×××0:0:µµ After multiplying through and noting our equivalence in (9) does not affect our ranking, we obtain the following: (11) ( )( )( )( )()qtfdtfjjjrankingdjjCtfCdtfdnormqP∏>+××× →0:1)(1µ 3We still have our term frequency,()dtfj, and our length normalization, d. In addition, we now have our IDF term, ()CtfCj. c. About that Multinomial Model… Let us first review some of the aspects and motivations of our language model. • We wanted to model a probabilistic text-generation process for the query based somehow on the document. • We wanted a method for assigning probabilities to text. While there are a variety of sophisticated models we could have chosen, we decided upon a multinomial one. Aside: Our multinomial model is not necessary the same as a unigram model, as considering the two equal creates some “checksum” issues. For example: Assume parameters θ1,…, θm s.t. 1=∑jjθ and for all j, 0>jθ. Thus “PU(q)” = [])(qtfjjj∏θ, where “PU(q)” is our probability of the query given some unigram model, assuming independence of occurrences for each term j. For the probability of any term v(j), we know the following: “PU(v(j))” = θ j Now these probabilities must sum to 1. )(∗ 1)"(")(==∑∑j jjjUvPθ However, we now note that there is no probability left for multiple terms occurring. ∗×=)( of because 0result expected)(21)2()1(θθvvPU We expected 21)2()1()(θθ×=vvP by our above definition of PU(q). However, we found that 0)()2()1(=vvPU due to the fact that 1)()()2()1()(≤+∑vvPvPUjjU by the rules of probability. 4How is our multinomial model different? )()()( qPqKqPUd×=

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 630 - Basic Language Modeling Approach

Sign up for free to view:

Please select your school