Modeling and Predicting Sequences HMM and may be CRF Amr Ahmed 10701 Feb 25 Big Picture Predicting a Single Label Input x A set of features Bag of words in a document Output y Class label Notation Note I use normal face letters for scalar as in y and bold face letters for vectors like x and y Topic of the document Predicting Sequence of Labels Input x A set of features with order structure among them Sequence of words in a sentence Output y Part of speech POS tag of each word Predicting Sequences Example POS NN VBD DD y x Students found the Example NP chunking y B I E O NN JJ HW easy B E O B E X Rockwell International Corp signed an agreement with Boeing Co Back To big Picture Single Output Generative Models P x y Predict using Bayes rule argmax P y x Na ve Bayes Discriminative Model P y x Predict using argmax P y x Logistic Regression Back To big Picture Sequence of Output Generative Models P x y Predict using Bayes rule argmaxy P y x HMM Discriminative Model P y x Predict using argmaxy P y x CRF HMM Defines a generative model over P x y Each x has M options and each y has K options You need a big table of size M X K Y We need to add some conditional independence assumption to make things manageable We have done that in Na ve Bayes y x x y1 y2 y3 yT x1 x2 x3 xT HMM y1 y2 y3 yT x1 x2 x3 xT What we need to define par Initial state P y1 Transitoin P yt yt 1 Emission P xt yt shorthand K 1 pi P y1 i K K 1 aij P yt 1 j yt i K M 1 bik P xt k yt i Factorization P x1 xT y1 yT P y1 p x1jy1 p y2jy1 P y2jy1 P xT jyT Q P x1 xT y1 yT P y1 t P ytjyt 1 P xtjyt Tasks Inference Find P y x MPA P yt x Veterbi P y X Learning Learning model parameters using MLE pi aij bik Fully Observed count and normalize Unsupervised EM Inference MPA Find argmaxi P yt i x We need to compute P yt i x first p yt ijx1 xT p yt i x1 xT p x1 xT p yt i x1 xt p xt 1 xT jyt i x1 xt p x1 xT p yt i x1 xt p xt 1 xT jyt i p x1 xT ti ti p x1 xT 1 y1 y2 y3 yT x1 x2 x3 xT A trick that we will use often add a variable and marginalize over to be able to apply recursion MPA We need to do that for any t 1 2 T Define a recursive program ti j t 1 p yt i x1 xt p yt 1 j x1 xt 1 yt 1 yt x1 A xA t 1 xt A y1 Divide variable into three sets X1 xt 1 yt 1 to be able to see t 1 yt xt then apply chain rule tk P x1 xt 1 xt yt k P x1 xt 1 xt yt 1 yt k yt 1 y P x1 xt 1 yt 1 P yt k yt 1 x1 xt 1 P xt yt k x1 xt 1 yt 1 t 1 y P x1 xt 1 yt 1 P yt k yt 1 P xt yt k t 1 P xt yt k i P x1 xt 1 yt 1 i P yt k yt 1 i P xt yt k i ti 1ai k Summing over yt 1 is just summing Over yt 1 1 K 11 P x1 y1 1 p 1 Forward Algorithm 11 21 1k 2k P x2 y2 1 i 1i ai 1 P x2 y2 k i 1i ai k T1 Tk 1k P x1 y1 k p k y1 y2 yT x1 x2 xT Inference MPA Find argmaxi P yt i x We need to compute P yt i x first p yt ijx1 xT p yt i x1 xT p x1 xT p yt i x1 xt p xt 1 xT jyt i x1 xt p x1 xT p yt i x1 xt p xt 1 xT jyt i p x1 xT ti ti p x1 xT 1 y1 y2 y3 yT x1 x2 x3 xT Backward Algorithm We need to do that for any t 1 2 T Define a recursive program add and marginalize trick ti p xt 1 xT jyt i j t 1 p xt 2 xT jyt 1 j tk P xt 1 xT yt k y P xt 1 xT yt 1 yt k yt yt 1 yT A xt xA t 1 xT A Divide variable into three sets yt 1 xt 1 Xt 2 xT to be able to see t 1 then apply chain rule t 1 i P yt 1 i yt k p xt 1 yt 1 i yt k P xt 2 xT xt 1 yt 1 i yt k i P yt 1 i yt k p xt 1 yt 1 i P xt 2 xT yt 1 i i ak i p xt 1 yt 1 i ti 1 Backward Algorithm 11 1k i a p x y i i 1 i T T T i a p x y i i k i T T T T 11 T1 1 T 1k Tk 1 y1 y2 yT x1 x2 xT Inference MPA Find argmaxi P yt i x We need to compute P yt i x first p yt ijx1 xT p yt i x1 xT p x1 xT p yt i x1 xt p xt 1 xT jyt i x1 xt p x1 xT p yt i x1 xt p xt 1 xT jyt i p x1 xT ti ti p x1 xT 1 y1 y2 y3 yT x1 x2 x3 xT Evaluation P x1 xT X P x1 xT yT yT k X P x1 xT yT i i 1 k X Ti i 1 Now we have everything to compute p yt ijx1 xT ti ti p x1 xT Practical Consideration are product of many terms Likely to run and you will into underflow for any sequence 10 Can we use logs tk P xt ytk 1 i ti 1ai k log tk log P xt ytk 1 log i i t 1 i k a In general we didn t get log on the right hand side but you can use a technique called log add that I didn t discuss in the recitation Solution rescaling normalize after each step Scaling Normalize after each step ct is a normalization constant Keep track of ct for all t i P x y k a t t k i t 1 i k t i P x y j t t i t 1ai j j i ct P xt yt j i t 1ai j j Scaling Interpretation How to interpret ct and the normalized …
View Full Document