Conditional Random FieldsAnnouncementsReview: motivation for CMM’sMotivation for CMMsImplications of the modelLabel Bias ProblemSlide 7How important is label bias?Another view of label bias [Sha & Pereira]Review of maxentReview of maxent/MEMM/CMMsDetails on CMMsFrom CMMs to CRFsWhat’s the new model look like?Slide 15Hammerley-CliffordDefinition of CRFsExample of CRFsGraphical comparison among HMMs, MEMMs and CRFsLafferty et al notationConditional Distribution (cont’d)CRF learning – from Sha & PereiraSlide 23Slide 24Slide 25Forward backward ideasSlide 27Slide 28Sha & Pereira resultsSlide 30POS tagging Experiments in Lafferty et alPOS tagging vs MXPostConditional Random FieldsWilliam W. CohenCALDAnnouncements•Upcoming assignments:–Today: Sha & Pereira, Lafferty et al–Mon 2/23: Klein & Manning, Toutanova et al–Wed 2/25: no writeup due–Mon 3/1: no writeup due–Wed 3/3: project proposal due: personnel + 1-2 page –Spring break week, no classReview: motivation for CMM’sIdeally we would like to use many, arbitrary, overlapping features of words.St-1StOtSt+1Ot+1Ot-1identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor………part ofnoun phraseis “Wisniewski”ends in “-ski”Motivation for CMMsSt-1StOtSt+1Ot+1Ot-1identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor………part ofnoun phraseis “Wisniewski”ends in “-ski”Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state...),|Pr(,1tttsxsImplications of the model•Does this do what we want?•Q: does Y[i-1] depend on X[i+1] ?–“a nodes is conditionally independent of its non-descendents given its parents”Label Bias Problem• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).• Per-state normalization does not allow the required expectation• Consider this MEMM:Label Bias Problem• Consider this MEMM, and enough training data to perfectly model it:Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1Pr(0123|rib)=1Pr(0453|rob)=1How important is label bias?•Could be avoided in this case by changing structure:•Our models are always wrong – is this “wrongness” a problem?•See Klein & Manning’s paper for next week….Another view of label bias [Sha & Pereira]So what’s the alternative?Review of maxent ')(0))',(exp()),(exp()|Pr()),(exp(),Pr())(exp()Pr(y iiiiiiiiiiiiixfyxfyxfxyyxfyxxfxiReview of maxent/MEMM/CMMs jjijjjiijjjjnniiiy iiiiiixZyyxfxyyxxyyxZyxfyxfyxfxy)()),,(exp()|Pr()...|...Pr(:MEMMfor )()),(exp())',(exp()),(exp()|Pr(1,111'Details on CMMsjjijjjiijjjjnnxZyyxfxyyxxyy)()),,(exp()|Pr()...|...Pr(1,111jjjjijjjijjijjjiijjijjjiijyyxfyyxFxZyyxFxZyyxf),,(),,( where,)()),,(exp()()),,(exp(1111From CMMs to CRFsjjjjiijjiiijjijjjiijyyxfyxFxZyxFxZyyxf),,(),( where,)()),(exp()()),,(exp(11Recall why we’re unhappy: we don’t want local normalization)()),(exp(xZyxFiiiNew modelWhat’s the new model look like?)(),,(exp()()),(exp(1xZyyxfxZyxFi jjjjiiiii x1 x2 x3y1 y2 y3What’s independent?What’s the new model look like?)(),,(exp()()),(exp(1xZyyxfxZyxFi jjjiiiii xy1 y2 y3What’s independent now??Hammerley-Clifford•For positive distributions P(x1,…,xn):–Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))–Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B–P can be written as normalized product of “clique potentials”CCxZx clique)(1)Pr(So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)Definition of CRFsX is a random variable over data sequences to be labeledY is a random variable over corresponding label sequencesExample of CRFsGraphical comparison among HMMs, MEMMs and CRFsHMM MEMM CRFLafferty et al notation1 2 1 2( , , , ; , , , ); andn n k kq l l l m m m l m= L Lx is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge featurek is the number of features are parameters to be estimatedy|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex vIf the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:(y | x) exp ( , y | , x) ( , y | , x)ql m� �� �� +� �� �� �k k e k k ve E,k v V ,kp f e g vConditional Distribution (cont’d)• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:Z(x) is a normalization over the data sequence x(y | x) exp ( , y | , x) ( , y |1(x), x)ql m� �� �= +� �� �� �k k e k k ve E,k v V ,kp f e g vZ•Learning:–Lafferty et al’s IIS-based method is rather inefficient.–Gradient-based methods are faster–Trickiest bit is computing normalization, which is over exponentially many y vectors.CRF learning – from Sha & PereiraCRF learning – from Sha & PereiraCRF learning – from Sha & PereiraSomething like forward-backwardIdea:• Define matrix of y,y’ “affinities” at stage i• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1xy1 y2 y3y1 y2 y3Forward backward
or
We will never post anything without your permission.
Don't have an account? Sign up