Graphical Models meet Margin-based LearningNext few lecturesHandwriting RecognitionSupport Vector MachinesHandwriting Recognition 2Handwriting Recognition 2SVMs vs. MNsSVMs, MNs vs. M3NsChain Markov Net (aka CRF*)Chain Markov Net (aka CRF*)Max (Conditional) LikelihoodOCR ExampleMax Margin EstimationM3NsM3NsM3N DualDual = Probability DistributionFactored Dual VariablesFactored ObjectiveFactored ObjectiveFactored ConstraintsFactored DualFactored DualGeneralization BoundHandwriting RecognitionHypertext ClassificationM3NsOther possible max-margin learning problemsAssociative Markov networksMax-margin AMNs resultsSegmentation resultsMax-margin parsingPCFGDisulfide bonds: non-bipartite matchingScoring functionLearning to clusterLearning to cluster resultsLearning to optimizeConclusionAcknowledgementsMax-Margin Markov Networks, B. Taskar, C. Guestrin and D. Koller. Neural Information Processing Systems Conference (NIPS03), 2003. http://www.cs.berkeley.edu/~taskar/pubs/mmmn.psLearning Associative Markov Networks, B. Taskar, V. Chatalbashev and D. Koller. Twenty First International Conference on Machine Learning (ICML04), 2004.http://www.cs.berkeley.edu/~taskar/pubs/mmamn.psMax-Margin Parsing, B. Taskar, D. Klein, M. Collins, D. Koller and C. Manning. Empirical Methods in Natural Language Processing (EMNLP04), 2004. http://www.cs.berkeley.edu/~taskar/pubs/mmcfg.psGraphical Models meet Margin-based LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 13th, 2005Next few lectures Today – Advanced topic in graphical models Next week – learning to make decisions with reinforcement learning Week after – Dealing with very large datasets, active learning and BIG PICTUREHandwriting RecognitionCharacter recognition: kernel SVMszcbcacrrrrrrSupport Vector MachinesAdvantages: SVMHigh-dim learning (kernels)Generalization boundsHandwriting Recognition 2SVMs for sequences?brakeaaaaabrickbraceProblem: # of classes exponential in lengthbrarezzzzz.....Handwriting Recognition 2Graphical models: HMMs, MNsLinear in lengthSVMs vs. MNsAdvantages: SVM MNHigh-dim learning (kernels)Generalization boundsEfficiently exploit label correlationsSVMs, MNs vs. M3NsAdvantages: SVM MN M3NHigh-dim learning (kernels)Generalization boundsEfficiently exploit label correlationsChain Markov Net (aka CRF*)P(y|x) = Z(x)Πiφ(xi,yi) Πiφ(yi,yi+1)1φ(xi,yi) = exp{∑αwαfα(xi,yi)} φ(yi,yi+1)= exp{∑βwβfβ(yi,yi+1)} fβ(y,y’) = I(y=‘z’,y’=‘a’)a-z a-z a-z a-z a-zfα(x,y) = I(xp=1, y=‘z’)yx*Lafferty et al. 01Chain Markov Net (aka CRF*)Πiφ(xi,yi) = exp{∑αwα∑ifα(xi,yi)} Πiφ(yi,yi+1)= exp{∑βwβ∑ifβ(yi,yi+1)} P(y|x) = Z(x)Πiφ(xi,yi) Πiφ(yi,yi+1) 1= Z(x)exp{wTf(x,y)}1w = [… , wα, … , wβ, …]f(x,y) = [… , fα(x,y) , … , fβ(x,y), …]fβ(x,y) = #(y=‘z’,y’=‘a’)a-z a-z a-z a-z a-zyfα(x,y) = #(xp=1, y=‘z’)x*Lafferty et al. 01Max (Conditional) Likelihoodx1,t(x1)…xm,t(xm)DEstimation Classificationf(x,y)Don’t need to learn entire distribution!OCR Example We want:argmaxwordwTf(,word) = “brace” Equivalently:wTf(,“brace”) > wTf( ,“aaaaa”)wTf(,“brace”) > wTf( ,“aaaab”)…wTf(,“brace”) > wTf( ,“zzzzz”)a lot!Max Margin Estimation Goal: find w such thatwTf(x,t(x)) > wTf(x,y) x∈D y≠t(x)wT[f(x,t(x)) – f(x,y)] > 0 Maximize margin γ Gain over y grows with # of mistakes in y: ∆tx(y)∆t (“craze”) = 2 ∆t (“zzzzz”) = 5w>∆fx(y) > 0w>∆fx(y) ≥γA Aw>∆f (“craze”) ≥ 2γw>∆f (“zzzzz”) ≥ 5γ∆tx(y)M3Nsx1,t(x1)…xm,t(xm)DEstimation Classificationf(x,y)M3NsEstimationDual QuadraticProgramExponential sizePolynomial sizeFactored DualM3N Dualα (“craze”)w>∆f (“craze”) ≥ 2γα (“zzzzz”)w>∆f (“zzzzz”) ≥ 5γ Exponential number of variablesαx(y) represents a probability distribution Key insight from graphical models: Can use network structure to factorize distributionDual = Probability Distributionαx(y)carora.4b c a r e b r o r e b r o c e b r a c e crao.2.15.25.8.4.6.2.4.4.2Factored Dual Variables Introduce factored dual variables: Linear in the size of the network Rewrite dual using µ’s:maximize QuadraticObjective(µ)s.t. µ∈ ConsistentMarginals (linear constraints)Equivalent to original dual!Factored Objectiveb c a r e b r o r e b r o c eb r a c e αrcaocr.2.15.25.4.2.35.65.8.4.61b1e2 2 10 ∆tµFactored ObjectiveFactored Constraintsnormalizationnon-negativityIf network is a treeElse add clique tree constraintsnormalizationnon-negativityagreementtriangulationFactored Dual Objective is quadratic in network size Constraint set is exponential in tree-width Linear for sequences and trees Complexity same as inferenceand max likelihoodFactored Dual Kernel trick works! Node and edge potentials can use kernelsnodes ⇒edges ⇒Generalization BoundTheorem:with probability at least 1-δ.Training set per-label γ-errorTest set per-label error Distribution-free First per-label bound Dependence on L logarithmic vs. linear [Colllins 01]L= number of nodes and edgesHandwriting Recognition051015202530CRFsMC–SVMs M^3 netsTest error (average per-character)rawpixelsquadratickernelcubickernelLength: ~8 charsLetter: 16x8 pixels 10-fold Train/Test5000/50000 letters600/6000 words Models:Multiclass-SVMs*CRFsM3nets better45% error reduction over linear CRFs33% error reduction over multiclass SVMs*Crammer & Singer 01Named Entity Recognition Locate and classify named entities in sentences: 4 categories: organization, person, location, misc. e.g. “U.N. official Richard Butler heads for Baghdad”. CoNLL 03 data set (200K words train, 50K words test)848586878889909192Test F1CRFsM^3N LinearM^3N Quad32% error reduction over CRFsU.N.officialRichardheadsforBaghdadyi= org/per/loc/misc/nonef(yi, x) = […,I(yi=org, xi=“U.N.”),I(yi=per, xi=capitalized),I(yi=loc, xi=known city),…, ]yxButlerbetter05101520Test ErrorSVMs RMNS M^3NsHypertext Classification WebKB dataset Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth53% error reduction over SVMs38% error reduction over RMNsrelaxed dual*Taskar et al 02betterloopy belief propagationM3NsBasic algorithm works for any low tree-width graphical
View Full Document