Max Margin Markov Networks B Taskar C Guestrin and D Koller Neural Information Processing Systems Conference NIPS03 2003 http www cs berkeley edu taskar pubs mmmn ps Learning Associative Markov Networks B Taskar V Chatalbashev and D Koller Twenty First International Conference on Machine Learning ICML04 2004 http www cs berkeley edu taskar pubs mmamn ps Max Margin Parsing B Taskar D Klein M Collins D Koller and C Manning Empirical Methods in Natural Language Processing EMNLP04 2004 http www cs berkeley edu taskar pubs mmcfg ps Graphical Models meet Marginbased Learning Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 13th 2005 Next few lectures Today Advanced topic in graphical models Next week learning to make decisions with reinforcement learning Week after Dealing with very large datasets active learning and BIG PICTURE Handwriting Recognition Character recognition kernel SVMs rr r r r c r a c z bc Support Vector Machines Advantages High dim learning kernels Generalization bounds SVM Handwriting Recognition 2 SVMs for sequences Problem of classes exponential in length brare brace zzzzz brick aaaaa brake Handwriting Recognition 2 Graphical models HMMs MNs Linear in length SVMs vs MNs Advantages High dim learning kernels Generalization bounds Efficiently exploit label correlations SVM MN SVMs MNs vs M3Ns Advantages High dim learning kernels Generalization bounds Efficiently exploit label correlations SVM MN M3N Chain Markov Net aka CRF P y x 1 Z x i xi yi i yi yi 1 xi yi exp w f xi yi yi yi 1 exp w f yi yi 1 y a z a z a z a z a z f y y I y z y a f x y I xp 1 y z x Lafferty et al 01 Chain Markov Net aka CRF P y x 1 Z x i xi yi i yi yi 1 1 Z x exp wTf x y i xi yi exp w w i f xi yw i w i y f x y f x y i yi yi 1 exp f x y w i f y i 1 y a z a z a z a z a z f x y y z y a f x y xp 1 y z x Lafferty et al 01 Max Conditional Likelihood D x1 t x1 xm t xm Estimation Classification f x y Don t need to learn entire distribution OCR Example We want argmaxword wT f word brace Equivalently wT f wT f wT f brace wT f brace wT f aaaaa aaaab brace wT f zzzzz a lot Max Margin Estimation Goal find w such that A wTf x t x wTf x y wT f x t x f x y 0 x D A y t x w fx y 0 tx y Maximize margin Gain over y grows with of mistakes in y tx y t w f craze 2 craze 2 t w f zzzzz 5 zzzzz 5 M3Ns D x1 t x1 xm t xm f x y Estimation Classification M3Ns Estimation Exponential size Dual Quadratic Program Polynomial size Factored Dual M3N Dual w f craze 2 craze w f zzzzz 5 zzzzz Exponential number of variables x y represents a probability distribution Key insight from graphical models Can use network structure to factorize distribution Dual Probability Distribution b b b b c 2 r 8 c r r r a o o a r r c c ca 2 ro 4 a 6 ra 4 o 4 e e e e 2 15 25 4 x y Factored Dual Variables Introduce factored dual variables Linear in the size of the network Rewrite dual using s maximize QuadraticObjective s t ConsistentMarginals linear constraints Equivalent to original dual Factored Objective t b b b b b 1 c r r r a o o a r r c c r 8 a 6 c 65 e c 2 o 4 r 35 e e e e 1 2 2 1 0 2 15 25 4 Factored Objective Factored Constraints normalization non negativity If network is a tree Else add clique tree constraints normalization non negativity agreement triangulation Factored Dual Objective is quadratic in network size Constraint set is exponential in tree width Linear for sequences and trees Complexity same as inference and max likelihood Factored Dual nodes edges Kernel trick works Node and edge potentials can use kernels Generalization Bound Theorem Test set Training set per label error per label error with probability at least 1 Distribution free First per label bound Dependence on L logarithmic vs linear Colllins 01 L number of nodes and edges Length 8 chars Letter 16x8 pixels 10 fold Train Test 5000 50000 letters 600 6000 words Test error average per character Handwriting Recognition 30 raw pixels quadratic kernel cubic kernel better 25 20 15 Models 10 Multiclass SVMs 45 error reduction over linear CRFs 5 CRFs 33 error reduction over multiclass SVMs 3 M nets 0 MC SVMs Crammer Singer 01 CRFs M 3 nets Named Entity Recognition Locate and classify named entities in sentences CoNLL 03 data set 200K words train 50K words test of f ic Ri ia l ch ar Bu d tle r he ad s Ba for gh da d U N 4 categories organization person location misc e g U N official Richard Butler heads for Baghdad yi org per loc misc none f yi x I yi org xi U N I yi per xi capitalized I yi loc xi known city 92 y x 91 better 90 T est F1 89 88 87 86 85 84 32 error reduction over CRFs CRFs M 3N Linear M 3N Quad Hypertext Classification WebKB dataset Four CS department websites 1300 pages 3500 links Classify each page faculty course student project other Train on three universities test on fourth better 20 relaxed dual Test Error 15 10 5 loopy belief propagation Taskar et al 02 0 53 error reduction over SVMs 38 error reduction over RMNs SVMs RMNS M 3Ns M3Ns Basic algorithm works for any low tree width graphical model Estimation Exponential size Dual Quadratic Program Polynomial size Factored Dual Other possible max margin learning problems Large tree width Markov networks with attractive potentials Parsing using probabilistic context free grammars Learning to cluster Max margin learning of any poly time problem Associative Markov networks Point features spin images point height Edge features length of edge edge orientation associative restriction i yi ij yj Max margin AMNs results Label ground building tree shrub Training 30 thousand points Testing 3 million points Segmentation results Hand labeled 180K test points Model Accuracy SVM 68 V SVM 73 M3N 93 Max margin parsing Classic learning problem P NP NP PP P NP DT NN Usually learn probabilities with counts Learn max margin discriminative model PCFG NP DT NN PP IN NP NN sea Disulfide bonds non bipartite matching 2 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 1 4 6 6 1 6 2 4 3 5 Fariselli Casadio 01 Baldi et al 04 3 5 Scoring function 2 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 2 3 4 5 1 4 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 3 6 6 5 String …
View Full Document