Machine Learning CS6375 Spring 2015 HMM Training a Instructor Yang Liu Slides modified from Vincent Ng Dan Jurafsky 1 Hidden Markov Models A S S A B O O S B O A S A S B O O A HMM N A B consists of the following elements N is a positive integer specifying the number of states in the model is a set of output symbols note we are discussing discrete cases now 2 1 Hidden Markov Models Other parameters j is the probability of choosing state j as an initial state ajk is the probability of transitioning from state j to state k bj o is the probability of emitting symbol o from state j In total has N N2 N parameters Note that we have the following constraints N j 1 j 1 N for all j k 1 a jk 1 for all j o b j o 1 3 Hidden Markov Models An HMM specifies a probability for each possible x y pair where x is a sequence of symbols drawn from and y is a sequence of states drawn from the integers 1 N The sequences x and y are restricted to have the same length For example say we have an HMM with N 2 a b and with some choice of the parameters Take x a a b b and y 1 2 2 1 Then in this case P x y 1a12 a22 a21a13b1 a b2 a b2 b b1 b 4 2 Three Problems in Hidden Markov Models Question 1 Evaluation What is the probability of the observation sequence O1O2 OT P O1O2 OT Question 2 Most Probable Path Given O1O2 OT what is the most probable path that I took Question 3 Learning HMMs Given O1O2 OT what is the maximum likelihood HMM that could have produced this string of observations 5 Starting out with Observable Markov Models We have state sequences How to train A aij aij C i j C i q q Q 6 3 Training Hidden Markov Models Say we have an HMM with N 2 K e f g h The supervised HMM training problem Given paired sequences e 1 g 2 e 1 h 2 f 1 h 2 f 1 g 2 how to choose parameter values for aij and bi o The unsupervised HMM training problem Given output sequences e g e h f h f g how to choose parameter values for aij and bi o How Maximum likelihood estimation What s the log likelihood function L 7 MLE for HMMs We have two sets X and Y and a joint distribution P x y In HMM each x X is an output sequence o1 oT each y Y is a state sequence q1 qT If we have fully observed data xi yi pairs then L log P xi yi i If we have partially observed data xi examples then L log P xi i log P xi y i y Y 8 4 HMM Training Caveat Network structure of HMM is always created by hand no algorithm for double induction of optimal structure and probabilities has been able to beat simple hand built structures Question is How to choose parameter values for aij and bi o 9 Supervised HMM Training An Example We have an HMM with N 2 K e f g h We see the following paired sequences in training data e 1 e 1 f 1 f 1 g 2 h 2 h 2 g 2 Maximum likelihood estimates 1 1 0 2 0 0 for parameters aij Intuitive relative frequency for parameters bi o j 1 j 2 j 3 o e o f o g o h i 1 0 1 0 i 1 0 5 0 5 0 0 i 2 0 0 1 i 2 0 0 0 5 0 5 Note state 3 is a dummy final state 10 5 Supervised HMM Training The Likelihood Function Notation Say x y o1 oT q1 qT and f i j x y number of times state j follows state i in x y f i x y number of times state i is the initial state in x y f i o x y number of times state i is paired with observation o Then P x y i 1 N 1 if i x y i 1 N 1 j 1 N aijf i j x y b o f i o x y i i 1 N 1 o K 11 Supervised HMM Training The Likelihood Function If we have training examples xl yl for l 1 m m L log P xl yl l 1 m f i x l 1 i 1 N 1 l yl log i f i j x l yl log aij f i o x l yl log bi o i 1 N 1 j 1 N i 1 N 1 o K 12 6 Maximizing the Likelihood Function Maximizing this function gives MLEs i f i x y f k x y l aij l l k l f i j x y f i k x y l l l bi o l l l k l l l l Just like our intuitive results f i o xl yl o K f i o xl yl 13 What about the Hidden State Case For HMM cannot compute those counts directly from the observed sequences Use Baum Welch algorithm Intuitions Iteratively estimate the counts Start with an estimate for aij and bk iteratively improve the estimates Special case of Expectation maximization EM Uses forward and backward probabilities 14 7 Recap Forward Probabilities Given an input sequence x1 xT t i P x1 xt qt i Base case forward probabilities t i i bi o1 for all i Recursive case t j t 1 i aij bi ot for all j 1 N and t 2 T i 15 Forward Procedure Computation of t j by summing over all previous values t 1 i for all i t 1 i t j 16 8 The Backward Probability We define the backward probability as follows t i P ot 1 ot 2 oT qt i This is the probability of generating partial observations Ot 1T from time t 1 to the end given that the HMM is in state i at time i and of course given model We compute it by induction Initialization T i 1 1 i N Induction N t i aij b j ot 1 t 1 j t T 1 1 1 i N j 1 17 Backward Procedure Computation of t i by weighted sum of all successive values t 1 18 9 Intuition for Re estimation of aij We will estimate a ij a ij via this intuition expected number of transitions from state i to state j expected number of transitions from state i Numerator intuition Assume we had some estimate of probability that a given transition i j was taken at time t in observation sequence If we knew this probability for each time t we could sum over all t to get expected …
View Full Document