DOC PREVIEW
Conditional Random Fields

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Conditional Random FieldsAnnouncementsReview: motivation for CMM’sMotivation for CMMsImplications of the modelLabel Bias ProblemSlide 7How important is label bias?Another view of label bias [Sha & Pereira]Review of maxentReview of maxent/MEMM/CMMsDetails on CMMsFrom CMMs to CRFsWhat’s the new model look like?Slide 15Hammerley-CliffordDefinition of CRFsExample of CRFsGraphical comparison among HMMs, MEMMs and CRFsLafferty et al notationConditional Distribution (cont’d)CRF learning – from Sha & PereiraSlide 23Slide 24Slide 25Forward backward ideasSlide 27Slide 28Sha & Pereira resultsSlide 30POS tagging Experiments in Lafferty et alPOS tagging vs MXPostConditional Random FieldsWilliam W. CohenCALDAnnouncements•Upcoming assignments:–Today: Sha & Pereira, Lafferty et al–Mon 2/23: Klein & Manning, Toutanova et al–Wed 2/25: no writeup due–Mon 3/1: no writeup due–Wed 3/3: project proposal due: personnel + 1-2 page –Spring break week, no classReview: motivation for CMM’sIdeally we would like to use many, arbitrary, overlapping features of words.St-1StOtSt+1Ot+1Ot-1identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor………part ofnoun phraseis “Wisniewski”ends in “-ski”Motivation for CMMsSt-1StOtSt+1Ot+1Ot-1identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor………part ofnoun phraseis “Wisniewski”ends in “-ski”Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state...),|Pr(,1tttsxsImplications of the model•Does this do what we want?•Q: does Y[i-1] depend on X[i+1] ?–“a nodes is conditionally independent of its non-descendents given its parents”Label Bias Problem• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).• Per-state normalization does not allow the required expectation• Consider this MEMM:Label Bias Problem• Consider this MEMM, and enough training data to perfectly model it:Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1Pr(0123|rib)=1Pr(0453|rob)=1How important is label bias?•Could be avoided in this case by changing structure:•Our models are always wrong – is this “wrongness” a problem?•See Klein & Manning’s paper for next week….Another view of label bias [Sha & Pereira]So what’s the alternative?Review of maxent ')(0))',(exp()),(exp()|Pr()),(exp(),Pr())(exp()Pr(y iiiiiiiiiiiiixfyxfyxfxyyxfyxxfxiReview of maxent/MEMM/CMMs jjijjjiijjjjnniiiy iiiiiixZyyxfxyyxxyyxZyxfyxfyxfxy)()),,(exp()|Pr()...|...Pr(:MEMMfor )()),(exp())',(exp()),(exp()|Pr(1,111'Details on CMMsjjijjjiijjjjnnxZyyxfxyyxxyy)()),,(exp()|Pr()...|...Pr(1,111jjjjijjjijjijjjiijjijjjiijyyxfyyxFxZyyxFxZyyxf),,(),,( where,)()),,(exp()()),,(exp(1111From CMMs to CRFsjjjjiijjiiijjijjjiijyyxfyxFxZyxFxZyyxf),,(),( where,)()),(exp()()),,(exp(11Recall why we’re unhappy: we don’t want local normalization)()),(exp(xZyxFiiiNew modelWhat’s the new model look like?)(),,(exp()()),(exp(1xZyyxfxZyxFi jjjjiiiii x1 x2 x3y1 y2 y3What’s independent?What’s the new model look like?)(),,(exp()()),(exp(1xZyyxfxZyxFi jjjiiiii xy1 y2 y3What’s independent now??Hammerley-Clifford•For positive distributions P(x1,…,xn):–Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))–Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B–P can be written as normalized product of “clique potentials”CCxZx clique)(1)Pr(So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)Definition of CRFsX is a random variable over data sequences to be labeledY is a random variable over corresponding label sequencesExample of CRFsGraphical comparison among HMMs, MEMMs and CRFsHMM MEMM CRFLafferty et al notation1 2 1 2( , , , ; , , , ); andn n k kq l l l m m m l m= L Lx is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge featurek is the number of features are parameters to be estimatedy|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex vIf the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:(y | x) exp ( , y | , x) ( , y | , x)ql m� �� �� +� �� �� �k k e k k ve E,k v V ,kp f e g vConditional Distribution (cont’d)• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:Z(x) is a normalization over the data sequence x(y | x) exp ( , y | , x) ( , y |1(x), x)ql m� �� �= +� �� �� �k k e k k ve E,k v V ,kp f e g vZ•Learning:–Lafferty et al’s IIS-based method is rather inefficient.–Gradient-based methods are faster–Trickiest bit is computing normalization, which is over exponentially many y vectors.CRF learning – from Sha & PereiraCRF learning – from Sha & PereiraCRF learning – from Sha & PereiraSomething like forward-backwardIdea:• Define matrix of y,y’ “affinities” at stage i• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1xy1 y2 y3y1 y2 y3Forward backward


Conditional Random Fields

Download Conditional Random Fields
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Conditional Random Fields and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Conditional Random Fields 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?