Learning about StateGeoff GordonMachine Learning DepartmentCarnegie Mellon Universityjoint work with Byron Boots, Sajid Siddiqi, Le Song, Alex SmolaGeoff Gordon—15-780—Apr, 2011. . . . . .What’s out there?2ot-2ot-1otot+1ot+2Geoff Gordon—15-780—Apr, 2011. . . . . .What’s out there?2ot-2ot-1otot+1ot+2Geoff Gordon—15-780—Apr, 2011. . . . . . steam rising from a grateWhat’s out there?2ot-2ot-1otot+1ot+2Geoff Gordon—15-780—Apr, 2011What’s out there? A dynamical system3. . . . . .ot-2ot-1otot+1ot+2← PastFuture →StateDynamical system = recursive rule for updating state based on observationsGeoff Gordon—15-780—Apr, 2011What’s out there? A dynamical system3. . . . . .ot-2ot-1otot+1ot+2← PastFuture →StateDynamical system = recursive rule for updating state based on observationsGeoff Gordon—15-780—Apr, 2011Given past observations from a partially observable systemLearning a dynamical system4. . . . . .ot-2ot-1otot+1ot+2← PastFuture →StateGeoff Gordon—15-780—Apr, 2011Given past observations from a partially observable systemLearning a dynamical system4. . . . . .ot-2ot-1otot+1ot+2← PastFuture →StateGeoff Gordon—15-780—Apr, 2011Given past observations from a partially observable systemPredict future observationsLearning a dynamical system4. . . . . .ot-2ot-1otot+1ot+2← PastFuture →StateGeoff Gordon—15-780—Apr, 2011Examples•Baum-Welsh EM algorithm for HMMs•Tomasi-Kanade structure from motion•Black-Scholes model of stock price•SLAM (from lidars, cameras, beacons, …)•System identification for Kalman filters•…5Geoff Gordon—15-780—Apr, 2011A general principlestatecompress expandbottleneckpredictdata about past(many samples)data about future(many samples)6Geoff Gordon—15-780—Apr, 2011A general principlestatecompress expandbottleneckpredictdata about past(many samples)data about future(many samples)6If bottleneck = rank constraint, get a spectral methodGeoff Gordon—15-780—Apr, 2011Why spectral methods?•Many ways to learn models of dynamical systems‣max likelihood via EM, gradient descent, …‣Bayesian inference via Gibbs, MH, …•In contrast to these, spectral methods give‣no local optima!‣⇒ huge gain in computational efficiency‣slight loss in statistical efficiency7Geoff Gordon—15-780—Apr, 2011Example: SSID for Kalman filter•Past data = last k observations•Future data = next k’ observations‣must have k and k’ big enough•Prediction = linear regression‣look at empirical covariance of past & future•Spectral: bottleneck = SVD of covariancex = A x– + noiseo = C x + noise[van Overschee & de Moor, 1996]8ACnn nmGeoff Gordon—15-780—Apr, 2011Kalman SSID•Assume for simplicity m ≥ n, both A and C full rank•For k ≥ 1,9x = A x– + noiseo = C x + noiseACP CTE[ot+ko�t]=E[E[ot+ko�t| xt]]= E[E[ot+k| xt]E[o�t| xt]]= E[CAkxt(Cxt)�]= CAkE[xtx�t]C�= CAkPC�Geoff Gordon—15-780—Apr, 2011Kalman SSID•Let U = left n leading singular vectors of Σ110Σk= E[ot+ko�t]=CAkPC�ˆAdef= U�Σ2(U�Σ1)†= U�CA2PC�(U�CAP C�)†=(U�CA)AP C�(PC�)†(U�CA)−1= SAS−1ˆCdef= U (SAS−1)−1= USA−1S−1= U (U�CA)A−1S−1= CS−1Geoff Gordon—15-780—Apr, 2011Kalman SSID•Algorithm: estimate Σ1 and Σ2 from data, get Û by SVD, plug in for  and Ĉ•Consistent: continuity of formulas for  and Ĉ, law of large numbers for Σ1 and Σ2‣wrinkle: SVD for Û isn’t continuous, but range(Û) is•Can also recover steady-state x11Geoff Gordon—15-780—Apr, 2011Variations•Use arbitrary features of length-k window of past and future observations‣work from covariance of past, future features‣good features make a big difference in practice•Impose constraints on learned model (e.g., stability)12Geoff Gordon—15-780—Apr, 2011Kalman SSID: example•Works well for video textures‣steam grate example above‣fountain:13observation = raw pixels (vector of reals over time)Geoff Gordon—15-780—Apr, 2011feature 1, step 2Structure from motion•Track N features over T steps14x11y11x12y12... x1Ty1Tx21y21x22y22... x2Ty2T..................xN 1yN 1xN 2yN 2... xNTyNT[Tomasi & Kanade, 1992]Geoff Gordon—15-780—Apr, 2011Structure from motion•xit is projection of feature i onto camera’s horizontal axis at time t (and yit, vertical)‣[ui, vi, wi] = feature i coordinates‣[h1t, h2t, h3t] = camera horizontal axis @t‣[v1t, v2t, v3t] = camera vertical axis @t15x11y11x12y12... x1Ty1Tx21y21x22y22... x2Ty2T..................xN 1yN 1xN 2yN 2... xNTyNTooooGeoff Gordon—15-780—Apr, 2011Structure from motion16u1v1w1u2v2w2.........uNvNwNh11v11h12v12... h1Tv1Th21v21h22v22... h2Tv2Th31v31h32v32... h3Tv3Toooo•xit is projection of feature i onto camera’s horizontal axis at time t (and yit, vertical)‣[ui, vi, wi] = feature i coordinates‣[h1t, h2t, h3t] = camera horizontal axis @t‣[v1t, v2t, v3t] = camera vertical axis @tGeoff Gordon—15-780—Apr, 2011Structure from motion•only determined up to an invertible transform17Geoff Gordon—15-780—Apr, 2011SfM as SSID•Past data: indicator of time step & h/v axis‣means we get to memorize each time step—no attempt to learn dynamics•Future data: observed screen coordinates (column of matrix)18x11y11x12y12... x1Ty1Tx21y21x22y22... x2Ty2T..................xN 1yN 1xN 2yN 2... xNTyNTcov =Geoff Gordon—15-780—Apr, 2011Kalman SSID: failureHMM (Baum-Welsh) Kalman Filter (SSID) Preview…19all models: 10 latent dimensionsGeoff Gordon—15-780—Apr, 2011Kalman SSID: failureHMM (Baum-Welsh) Kalman Filter (SSID) Preview…19all models: 10 latent dimensionsGeoff Gordon—15-780—Apr, 2011Kalman SSID: failureHMM (Baum-Welsh) Kalman Filter (SSID) Preview…19all models: 10 latent dimensionsGeoff Gordon—15-780—Apr, 2011Kalman SSID: failureHMM (Baum-Welsh) Kalman Filter (SSID) Preview…19all models: 10 latent dimensionsGeoff Gordon—15-780—Apr, 2011Can we generalize?•Get rid of Gaussian noise assumption•HMM: same form as Kalman filter, but‣A ≥ 0, A1 = 1, C ≥ 0, C1 = 1‣noise ~ multinomial‣x, o are indicators: e.g., “4” = [0 0 0 1 0]T20x = A x– + noiseo = C x + noiseACnn nmGeoff Gordon—15-780—Apr, 2011Derivations for Kalman v. HMM21E[ot+ko�t]= E[E[ot+ko�t| xt]]= E[E[ot+k| xt]E[o�t| xt]]= E[CAkxt(Cxt)�]=
View Full Document