Page 1CS 287: Advanced RoboticsFall 2009Lecture 4: Control 3: Optimal control---discretization (function approximation)Pieter AbbeelUC Berkeley EECS Tuesday Sept 15: **no** lectureAnnouncementPage 2 Optimal control: provides general computational approach to tackle control problems---both under- and fully actuated. Dynamic programming Discretization Dynamic programming for linear systems Extensions to nonlinear settings: Local linearization Differential dynamic programming Feedback linearization Model predictive control (MPC) Examples:Today and forthcoming lectures Optimal control formalism [Tedrake, Ch. 6, Sutton and Barto Ch.1-4] Discrete Markov decision processes (MDPs) Solution through value iteration [Tedrake Ch.6, Sutton and Barto Ch.1-4] Solution methods for continuous problems: HJB equation [[[Tedrake, Ch. 7 (optional)]]] Markov chain approximation method [Chow and Tsitsiklis, 1991; Munos and Moore, 2001] [[[Kushner and Dupuis 2001 (optional)]]] Continuous discrete [Chow and Tsitsiklis, 1991; Munos and Moore, 2001] [[[Kushner and Dupuis 2001 (optional)]]] Error bounds: Value function: Chow and Tsitsiklis; Kushner and Dupuis; function approximation [Gordon 1995; Tsitsiklis and Van Roy, 1996] Value function close to optimal resulting policy good Speed-ups and Accuracy/Performance improvementsToday and ThursdayPage 3Optimal control formulationGiven:dynamics : ˙x(t) = f(x(t), u(t), t)cost function : g(x, u, t)Task: find a policy u(t) = π(x, t) which optimizes:Jπ(x0) = h(x(T )) +T0g(x(t), u(t), t)dtApplicability: g and f often easier to specify than π Markov decision process (MDP) (S, A, P, H, g) S: set of states A: set of actions P: dynamics model H: horizon g: S x A R cost function Policy Cost-to-go of a policy π: Goal: findFinite horizon discrete timeπ = (µ0, µ1, . . . , µH), µk: S→AJπ(x) = E[Ht=0g(x(t), u(t))|x0= x, π]π∗∈arg minπ∈ΠJπP (xt+1= x′|xt= x, ut= u)Page 4Dynamic programming (aka value iteration)Let J∗k= minµk,...,µHE[Ht=kg(xt, ut)], then we have:J∗H(x) = minug(x(H), u(H))J∗H−1(x) = minug(x, u) + x′P (x′|x, u)J∗H(x′). . .J∗k(x) = minug(x, u) + x′P (x′|x, u)J∗k+1(x′). . .J∗0(x) = minug(x, u) + x′P (x′|x, u)J∗1(x′)Andµ∗k(x) = arg minug(x, u) + x′P (x′|x, u)J∗k+1(x′); Running time: O(|S|2|A| H) vs. naïve search over all policies would require evaluation of |A||S|Hpolicies Markov decision process (MDP) (S, A, P, γ, g) γ: discount factor Policy Value of a policy π: Goal: findDiscounted infinite horizonπ = (µ0, µ1, . . .), µk: S→AJπ(x) = E[∞t=0γtg(x(t), u(t))|x0= x, π]π∗∈arg minπ∈ΠVπPage 5 Dynamic programming (DP) aka Value iteration (VI):For i=0,1, …For all s ∈ S Facts:Discounted infinite horizonJ(i+1)(s) ← minu∈A s′P (s′|s, u)g(s, a) + γJ(i)(s′)There is an optimal stationary policy: π∗= (µ∗, µ∗, . . .) which satisfies:µ∗(x) = arg minug(x, u) + γ x′P (x′|x, u)J∗(x)J(i)→J∗fori→∞ Hamilton-Jacobi-Bellman equation / approach: Continuous equivalent of discrete case we already discussed We will see 2 slides. Variational / Markov chain approximation method: Numerically solve a continuous problem by directly approximating the continuous MDP with a discrete MDP We will study this approach in detail.Continuous time and state-action spacePage 6Hamilton-Jacobi-Bellman (HJB) [*]Hamilton-Jacobi-Bellman (HJB) [*] Can also derive HJB equation for the stochastic setting. Keywords for finding out more: Controlled diffusions / diffusion jump processes. For special cases, can assist in finding / verifying analytical solutions However, for most cases, need to resort to numerical solution methods for the corresponding PDE --- or directly approximate the control problem with a Markov chain References: Tedrake Ch. 7; Bertsekas, “Dynamic Programming and Optimal Control.” Oksendal, “Stochastic Differential Equations: An Introduction with Applications” Oksendal and Sulem, “Applied Stochastic Control of Jump Diffusions” Michael Steele, “Stochastic Calculus and Financial Applications” Markov chain approximations: Kushner and Dupuis, 1992/2001Page 7Markov chain approximation (“discretization”) Original MDP (S, A, P, R, γ) Discretized MDP: Grid the state-space: the vertices are the discrete states. Reduce the action space to a finite set. Sometimes not needed: When Bellman back-up can be computed exactly over the continuous action space When we know only certain controls are part of the optimal policy (e.g., when we know the problem has a “bang-bang” optimal solution) Transition function remains to be resolved! ξξξξξξξξξξξξs‘sDiscretization: example 1Discrete states: { ξ, …, ξ}P (ξ2|s, a) = pA;P (ξ3|s, a) = pB;P (ξ6|s, a) = pC;s.t. s′= pAξ2+ pBξ3+ pCξ6a Results in discrete MDP, which we know how to solve. Policy when in “continuous state”:Note: need not be triangular. [See also: Munosand Moore, 2001.]π(s) = arg minag(s, a) + γs′P (s′|s, a)iP (ξi; s′)J(ξi)Page 8Discretization: example 1 (ctd) Discretization turns deterministic transitions into stochastic transitions If MDP already stochastic Repeat procedure to account for all possible transitions and weight accordingly If a (state, action) pair can results in infinitely many different next states: Sample next states from the next-state distributionDiscretization: example 1 (ctd) Discretization results in finite state stochastic MDP, hence we know value iteration will converge Alternative interpretation: the Bellman back-ups in the finite state MDP are (a) back-ups on a subset of the full state space (b) use linear interpolation to compute the required “next-state cost-to-go functions” whenever the next state is not in the discrete set= value iteration with function approximationPage 9Discretization: example 2Discrete states: { ξ, …, ξ}Similarly define transition probabilities for all ξiξξξξξξs‘P (ξ2|s, a) = 1;a Results in discrete MDP, which we know how to solve. Policy when in “continuous state”: This is nearest neighbor;
View Full Document