11School of Computer ScienceLarge Margin Structured ModelsProbabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)Lecture # 21Lecture # 2111/28/200711/28/2007Eric XingEric XingReceptor AKinase CTF FGene GGene HKinaseEKinase DReceptor BX1X2X3X4X5X6X7X8Receptor AKinase CTF FGene GGene HKinaseEKinase DReceptor BX1X2X3X4X5X6X7X8X1X2X3X4X5X6X7X8Eric Xing 2OCR examplebraceSequential structurexya-z a-z a-z a-z a-zyx2Eric Xing 3Image segmentation exampleSpatial structurexyyiyjEric Xing 4Structured modelsspace of feasible outputsscoring functionAssumptions: linear combination of featuressum of part scores: • index p represents a part in the structure3Eric Xing 5Learning wz Training examplesz Probabilistic approach:z Computing Zw(x) can be NP-completez Tractable models but intractable estimation z Large margin approach: z Exact and efficient when prediction is tractableb r a c eEric Xing 6Learning Strategyz Recall that in CRFz We predict based on:z And we learn based on:z MaxMargin:z We predict based on:z And we learn based on:⎭⎬⎫⎩⎨⎧==∑ccccyyxfxZxypxy ),(exp),()|(maxarg|*θθθ1{}∏∑∏⎭⎬⎫⎩⎨⎧==nccnnccnnnnnncyxfxZxypxyc),(exp),()|(maxarg,|,*θθθθθ1),(maxarg),(maxarg|*yxFwyxfxyTyccccy==∑θ{} ( )⎟⎠⎞⎜⎝⎛−=∀≠),'(),(maxmaxarg,|,'*nnnnTnyywnnxyFxyFwxywnn4Eric Xing 7MLE of Feature Based UGMsz Scaled likelihood functionz Instead of optimizing this objective directly, we attack its lower boundz The logarithm has a linear upper bound …z This bound holds for all µ, in particular, forz Thus we have ∑∑===xnnxpxpxpNNDD)|(log)(~)|(log/);();(~θθθθ1ll∑∑−=xiiiZxfxp)(log)()(~θθ1−−≤µθµθlog)()(logZZ)()(tZθµ1−=1+−−≥∑∑xttiiiZZZxfxpD)(log)()()()(~);(~)()(θθθθθlEric Xing 8Generalized Iterative Scaling (GIS)z Lower bound of scaled loglikelihoodz Definez Relax againz Assume z Convexity of exponential: z We have:1+−−≥∑∑xttiiiZZZxfxpD)(log)()()()(~);(~)()(θθθθθl)(def)(tiitiθθθ−=∆11111+−⎭⎬⎫⎩⎨⎧∆−=+−⎭⎬⎫⎩⎨⎧∆⎭⎬⎫⎩⎨⎧−=+−⎭⎬⎫⎩⎨⎧−≥∑∑∑∑∑∑∑∑∑∑∑∑∑)(log)(exp)|()()(~)(log)(exp)(exp)()()(~)(log)(exp)()()(~);(~)()()()()()()()()(tiitixtiixitiitiiitixtiixixtiiixtiiiZxfxpxfxpZxfxfZxfxpZxfZxfxpDθθθθθθθθθθθθθθl()()∑∑≤iiiiiixxexpexpππ())()(logexp)()|()()(~);(~def)()()(θθθθθθΛ=+−∆−≥∑∑∑∑1txitiitiixiZxfxpxfxpDl10 =≥∑iiixfxf)( ,)(5Eric Xing 9GISz Lower bound of scaled loglikelihoodz Take derivative:z Set to zeroz where p(t)(x) is the unnormalized version of p(x|θ(t))z Update())()(logexp)()|()()(~);(~def)()()(θθθθθθΛ=+−∆−≥∑∑∑∑1txitiitiixiZxfxpxfxpDl()∑∑∆−=∂Λ∂xittiixixfxpxfxp)()|(exp)()(~)()(θθθ)()()()()(~)()|()()(~)()()()(txitixxitixZxfxpxfxpxfxpxfxpetiθθθ∑∑∑∑==∆)()()()()()()()()(xftttititiitiexpxpθθθθ∆++=⇒∆+=11⇒()∏∏∏⎟⎠⎞⎜⎝⎛=∑⎟⎠⎞⎜⎝⎛=⎟⎠⎞⎜⎝⎛=∑∑∑∑∑∑+ixfxfxpxfxptixftxfxfxpxfxpttixftxfxpxfxptttixitixiiixitixixitixxpZZxpZZxpxp)()()()()(~)()()()()()()()(~)()()()()()()()(~)()()()()()()()()()()()()()(θθθθ1Eric Xing 10Learning wz Training examplesz Probabilistic approach:z Computing Zw(x) can be NP-completez Tractable models but intractable estimation z Large margin approach: z Exact and efficient when prediction is tractableb r a c e6Eric Xing 11OCR Examplez We want:argmaxwordwTf(, word) = “brace”z Equivalently:wTf(,“brace”) > wTf( ,“aaaaa”)wTf(,“brace”) > wTf( ,“aaaab”)…wTf(,“brace”) > wTf( ,“zzzzz”)a lot!Eric Xing 12Large Margin Estimationz Given training example (x, y*), we want:*Taskar et al. 03z Maximize marginz Mistake weighted margin:# of mistakes in y7Eric Xing 13Large Margin Estimationz Recall from SVMs: z Maximizing margin γ is equivalent to minimizing the square of the L2-norm of the weight vector w:z New objective function:Eric Xing 14z Brute force enumeration of constraints:z The constraints are exponential in the size of the structurez Alternative: min-max formulation z add only the most violated constraintz Handles more general loss functionsz Only polynomial # of constraints needed Min-max Formulation8Eric Xing 15Min-max formulationz Key step: convert the maximization in the constraint from discrete to continuousz This enables us to plug it into a QP z How to do this conversion?z Linear chain example in the next slides Ædiscrete optim.continuous optim.Eric Xing 16y ⇒ z map for linear chain structures00000. . . 0.000.100:010:100:010:100:10B:BA00000. . . 0.010.0000000. . . 0.000.1000000. . . 0.100.00B:BAB.BA B.BA B.BA B.BA9Eric Xing 17y ⇒ z map for linear chain structures000000100000000001000010normalizationagreementRewriting the maximization function in terms of indicator variables:Eric Xing 18Min-max formulationz Original problem:z Transformed problem:z Has integral solutions z for chains, treesz Can be fractional for untriangulated networks10Eric Xing 19Min-max formulationz Using strong Lagrangian duality:(beyond the scope of this lecture)z Use the result above to minimize jointly over w and µ: Eric Xing 20Min-max formulationz Formulation produces compact QP forz Low-treewidth Markov networks z Associative Markov networksz Context free grammarsz Bipartite matchingsz Any problem with compact LP inference11Eric Xing 21Results: Handwriting RecognitionLength: ~8 charsLetter: 16x8 pixels 10-fold Train/Test5000/50000 letters600/6000 words Models:Multiclass-SVMs*CRFsM3nets *Crammer & Singer 01051015202530CRFsMC–SVMs M^3 netsTest error (average per-character)rawpixelsquadratickernelcubickernel45% error reduction over linear CRFs33% error reduction over multiclass SVMsbetterEric Xing 2205101520Test ErrorSVMs RMNS M^3NsResults: Hypertext Classificationz WebKB datasetz Four CS department websites: 1300 pages/3500 linksz Classify each page: faculty, course, student, project, otherz Train on three universities/test on fourth53% error reduction over SVMs38% error reduction over RMNs*Taskar et al 02better12Eric Xing 23Summaryz Structured Maximum Margin Networksz Concise representationz Efficient QP algorithmsz Applications:z Sequencesz Treesz Matchings ….z Strong empirical results z Acknowledgments: z Adopted from two different presentations given by Ben Taskar.Eric Xing 24Open Problemsz Unsupervised CRF learning and MaxMargin Learningz We want to recognize a pattern that is
View Full Document