11Machine LearningCS6375 --- Spring 2015aHMM TrainingInstructor: Yang LiuSlides modified from Vincent Ng, Dan Jurafsky2Hidden Markov ModelsA HMM (N, Σ, π, A, B) consists of the following elements:• N is a positive integer specifying the number of states in the model. • Σ is a set of output symbols (note: we are discussing discrete cases now).ABAAABBSSSOOOSOSO23Hidden Markov ModelsOther parameters Θ:• πjis the probability of choosing state j as an initial state.• ajkis the probability of transitioning from state j to state k.• bj(o) is the probability of emitting symbol o from state j.• In total: Θ has N + N2+ N*|Σ| parameters.Note that we have the following constraints:.1,allfor1=∑=Nkjkaj.11=∑=Njjπ.1)(,allfor =∑Σ∈ojobj4Hidden Markov ModelsAn HMM specifies a probability for each possible (x,y) pair, where x is a sequence of symbols drawn from Σ, and yis a sequence of states drawn from the integers 1, …, N. The sequences x and y are restricted to have the same length.For example, say we have an HMM with N = 2, Σ = {a, b}, and with some choice of the parameters . Take x = <a,a,b,b> and y = <1,2,2,1>. Then in this case,)()()()()|,(1221132122121bbbbababaaaayxPπ=Θ35Three Problems in Hidden Markov Models• Question 1: EvaluationWhat is the probability of the observation sequence O1O2 … OT, P(O1O2 … OT|λ)?• Question 2: Most Probable PathGiven O1O2… OT, what is the most probable path that I took?• Question 3: Learning HMMs:Given O1O2… OT, what is the maximum likelihood HMM that could have produced this string of observations?6Starting out with Observable Markov ModelsWe have state sequences.How to train?A = {aij}:aij=C(i→j)C(i → q)q∈Q∑47Training Hidden Markov ModelsSay we have an HMM with N = 2, K = { e, f, g, h }.The supervised HMM training problemGiven paired sequences {(e/1 g/2), (e/1 h/2), (f/1 h/2), (f/1 g/2)}, how to choose parameter values for ππππ, aij, and bi(o)?The unsupervised HMM training problemGiven output sequences { (e g), (e h), (f h), (f g)}, how to choose parameter values for ππππ, aij, and bi(o)?How?Maximum likelihood estimation!What’s the log likelihood function L(ΘΘΘΘ)?8MLE for HMMsWe have two sets X and Y, and a joint distribution P(x, y | ΘΘΘΘ) In HMM:each x ∈ X is an output sequence o1, …, oTeach y ∈ Y is a state sequence q1, …, qTIf we have fully observed data, (xi, yi) pairs, thenIf we have partially observed data, xiexamples, then )|,(log)( Θ=Θ∑iiiyxPL)|(log)( Θ=Θ∑iixPL∑∑∈Θ=i YyiyxP )|,(log59HMM Training: CaveatNetwork structure of HMM is always created by hand– no algorithm for double-induction of optimal structure and probabilities has been able to beat simple hand-built structures.Question is: How to choose parameter values for ππππ, aij, and bi(o)?10Supervised HMM Training: An ExampleWe have an HMM with N = 2, K = {e, f, g, h }We see the following paired sequences in training datae/1 g/2e/1 h/2f/1 h/2f/1 g/2Maximum likelihood estimates:ππππ1= 1.0, ππππ2= 0.0for parameters aij: for parameters: bi(o):j=1 j=2 j=3i=1 0 1 0i=2 0 0 1o=e o=f o=g o=hi=1 0.5 0.5 0 0i=2 0 0 0.5 0.5Intuitive? relative frequencyNote: state 3 is a dummy final state611Notation: Say (x,y) = {o1…oT, q1… qT}, andf(i,j,x,y) = number of times state j follows state i in (x,y)f(i,x,y) = number of times state i is the initial state in (x,y) f(i,o,x,y) = number of times state i is paired with observation oThen∏∏∏−∈=−=∈−∈=}1...1{}...1{},1...1{ },1...1{),,,(),,,(),,()(),(NiNjNiKoNiyxoifiyxjifijyxifiobayxPπSupervised HMM Training: The Likelihood Function12Supervised HMM Training: The Likelihood FunctionIf we have training examples (xl, yl) for l = 1 … m,∑ ∑∑= −∈=+==Θml NiillmlllyxifyxPL1 }1...1{,1,log),(()(log)(π∑∈−∈+}...1{},1...1{,log),,(NjNiijllayxjif))(log),,(},1...1{,∑∈−∈KoNiillobyxoif713Maximizing the Likelihood FunctionMaximizing this function gives MLEs:∑ ∑∑=l kllllliyxkfyxif),,(),,(π∑ ∑∑=l klllllijyxkifyxjifa),,,(),,,(∑ ∑∑∈=l Kollllliyxoifyxoifob'),,',(),,,()(Just like our intuitiveresults 14What about the Hidden State Case?For HMM, cannot compute those counts directly from the observed sequencesUse Baum-Welch algorithm. Intuitions:Iteratively estimate the counts. Start with an estimate for aijand bk, iteratively improve the estimatesSpecial case of Expectation-maximization (EM)Uses forward and backward probabilities815Recap: Forward ProbabilitiesGiven an input sequence x1… xT:Base case:Recursive case:)|,...()(1Θ==iqxxPitttαforward probabilitiesiobiiitallfor)()(1πα=TtNjobaijitiijtt...2and...1allfor)()()(1===∑−αα16Forward ProcedureComputation of αt(j) by summing over all previous values αt-1(i) for all iαt-1(i) αt(j)917The Backward ProbabilityWe define the backward probability as follows:This is the probability of generating partial observations Ot+1T(from time t+1 to the end), given that the HMM is in state i at time i and of course given model Φ.We compute it by induction:Initialization:Induction:βt(i)=P(ot+1,ot+2,...oT,|qt=i,Φ)NiiT≤≤= 1 ,1)(ββt(i) = aijbj(ot+1)j=1N∑βt+1(j), t = T−1..1, 1≤ i ≤ N18Backward ProcedureComputation of βt(i) by weighted sum of all successive values βt+11019Intuition for Re-estimation of aijWe will estimate via this intuition:Numerator intuition:Assume we had some estimate of probability that a given transition i->j was taken at time t in observation sequence.If we knew this probability for each time t, we could sum over all tto get expected value (count) for i->j.ijiaij state from ns transitioofnumber expected state to state from ns transitioofnumber expectedˆ=ijaˆ20Re-estimation of aijLet γtbe the probability of being in state i at time t and state j at time t+1, given O1..Tand model Φ:)|()|,,(),|,(),(11ΦΦ===Φ===++OPOjqiqPOjqiqPjitttttγ1121Computing Numerator for γt)( and ,, :)|,,( of componentsfour The11 ++Φ==tjijttobaOjqiqPβα22Computing Numerator for γ∑++++=ΦΦ===Φ===ittttjijttttttiijObaiOPOjqiqPOjqiqPji)()()()()()|()|,,(),|,(),(1111βαβαγP(O|Φ) can use forward only; backward only; or combination of forward and backward1223From γ to aij24Re-estimating the Observation Likelihood bjvvbkkj statein timesofnumber expected symbol observing and j statein timesofnumber expected)(ˆ=1325Computing ξComputation of ξj(t), the probability of being in state j at time t.26Reestimating the Observation
View Full Document