ProbabilisticGraphical ModelsLecture 12 – Dynamical ModelsCS/CNS/EE 155Andreas Krause2AnnouncementsHomework 3 out tonightStart early!!Project milestones due todayPlease email to TAs3Parameter learning for log-linear modelsFeature functions φi(Ci) defined over cliquesLog linear model over undirected graph GFeature functions φ1(C1),…,φk(Ck)Domains Cican overlapJoint distributionHow do we get weights wi?4Log-linear conditional random fieldDefine log-linear model over outputs YNo assumptions about inputs XFeature functions φi(Ci,x) defined over cliques and inputsJoint distribution5Example: CRFs in NLPClassify into Person, Location or OtherMrs. Greene spoke today in New York. Green chairs the finance committeeX1X2X3X4X5X6X7X8X9X10X11X12Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y126Example: CRFs in vision7Gradient of conditional log-likelihoodPartial derivativeRequires one inference perCan optimize using conjugate gradient8Exponential Family DistributionsDistributions for log-linear modelsMore generally: Exponential family distributionsh(x): Base measurew: natural parametersφ(x): Sufficient statisticsA(w): log-partition functionHereby x can be continuous (defined over any set)9ExamplesExp. Family:Gaussian distributionOther examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, …h(x): Base measurew: natural parametersφ(x): Sufficient statisticsA(w): log-partition function10Moments and gradientsCorrespondence between moments and log-partition function (just like in log-linear models)Can compute moments from derivatives, and derivatives from moments!MLE moment matching11Conjugate priors in Exponential FamilyAny exponential family likelihood has a conjugate prior12Exponential family graphical modelsSo far, only defined graphical models over discrete variables.Can define GMs over continuous distributions! For exponential family distributions:Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family modelsImportant example: Gaussian Networks13Multivariate Gaussian distributionJoint distribution over n random variables P(X1,…Xn)σjk= E[ (Xj– µj) (Xk- µk) ]Xjand Xkindependent σjk=014MarginalizationSuppose (X1,…,Xn) ~ N( µ, Σ)What is P(X1)??More generally: Let A={i1,…,ik} ⊆ {1,…,N}Write XA= (Xi1,…,Xik)XA~ N( µA, ΣAA)15ConditioningSuppose (X1,…,Xn) ~ N( µ, Σ)Decompose as (XA,XB)What is P(XA | XB)??P(XA = xA| XB = xB) = N(xA; µA|B, ΣA|B) whereComputable using linear algebra!16Conditional linear Gaussians17Canonical RepresentationMultivariate Gaussians in exponential family!Standard vs canonical form:µ = Λ-1ηΣ = Λ-118Gaussian NetworksZeros in precision matrix Λ indicate missing edgesin log-linear model!19Inference in Gaussian NetworksCan compute marginal distributions in O(n3)!For large numbers n of variables, still intractableIf Gaussian Network has low treewidth, can use variable elimination / JT inference!Need to be able to multiply and marginalize factors!20Multiplying factors in Gaussians21Conditioning in canonical formJoint distribution (XA, XB) ~ N(ηAB,ΛAB)Conditioning: P(XA| XB= xB) = N(xA; ηA|B=xB, ΛA|B=xB)22Marginalizing in canonical formRecall conversion formulasµ = Λ-1ηΣ = Λ-1Marginal distribution23Standard vs. canonical formStandard form Canonical formMarginalizationConditioningIn standard form, marginalization is easyIn canonical form, conditioning is easy!24Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices)25Dynamical models26HMMs / Kalman FiltersMost famous Graphical models:Naïve Bayes modelHidden Markov modelKalman FilterHidden Markov modelsSpeech recognitionSequence analysis in comp. bioKalman Filters controlCruise control in carsGPS navigation devicesTracking missiles..Very simple models but very powerful!!27HMMs / Kalman FiltersX1,…,XT: Unobserved (hidden) variablesY1,…,YT: ObservationsHMMs: XiMultinomial, YiarbitraryKalman Filters: Xi, YiGaussian distributionsNon-linear KF: XiGaussian, YiarbitraryY1Y2Y3Y4Y5Y6X1X2X3X4X5X628HMMs for speech recognitionInfer spoken words from audio signals28Y1Y2Y3Y4Y5Y6PhonemeX1X2X3X4X5X6Words“He ate the cookies on the couch”29Hidden Markov ModelsInference:In principle, can use VE, JT etc.New variables Xt, Ytateach time step need to rerunBayesian Filtering:Suppose we already have computed P(Xt| y1,…,t)Want to efficiently compute P(Xt+1| y1,…,t+1)Y1Y2Y3Y4Y5Y6X1X2X3X4X5X630Bayesian filteringStart with P(X1)At time tAssume we have P(Xt| y1…t-1)Condition: P(Xt| y1…t)Prediction: P(Xt+1, Xt| y1…t)Marginalization: P(Xt+1| y1…t)Y1Y2Y3Y4Y5Y6X1X2X3X4X5X631Parameter learning in HMMsAssume we have labels for hidden variablesAssume stationarityP(Xt+1| Xt) is same over all time stepsP(Yt| Xt) is same over all time stepsViolates parameter independence ( parameter “sharing”)Example: compute parameters for P(Xt+1=x | Xt=x’)What if we don’t have labels for hidden vars? Use EM (later this course)32Kalman Filters (Gaussian HMMs)X1,…,XT: Location of object being trackedY1,…,YT: ObservationsP(X1): Prior belief about location at time 1P(Xt+1|Xt): “Motion model”How do I expect my target to move in the environment?Represented as CLG: Xt+1= A Xt+ N(0, ΣM)P(Yt| Xt): “Sensor model”What do I observe if target is at location Xt?Represented as CLG: Yt= H Xt+ N(0, ΣO)Y1Y2Y3Y4Y5Y6X1X2X3X4X5X633Understanding Motion model34Understanding sensor model35Bayesian Filtering for KFsCan use Gaussian elimination to perform inference in “unrolled” model Start with prior belief P(X1)At every timestep have belief P(Xt| y1:t-1)Condition on observation: P(Xt| y1:t)Predict (multiply motion model): P(Xt+1,Xt| y1:t)“Roll-up” (marginalize prev. time): P(Xt+1| y1:t)Y1Y2Y3Y4Y5Y6X1X2X3X4X5X636ImplementationCurrent belief: P(xt| y1:t-1) = N(xt; ηXt, ΛXt)Multiply sensor and motion modelMarginalize37What if observations not “linear”?Linear observations:Yt= H Xt+ noiseNonlinear observations:38Incorporating Non-gaussian observationsNonlinear observation P(Yt| Xt) not GaussianFirst approach: Approximate P(Yt| Xt) as CLGLinearize P(Yt| Xt) around current estimate E[Xt| y1..t-1]Known as Extended Kalman Filter (EKF)Can perform poorly if P(Yt| Xt) highly nonlinearSecond approach: Approximate P(Yt, Xt) as GaussianTakes correlation in Xtinto accountAfter obtaining approximation, condition on Yt=yt(now a “linear”
View Full Document