Unformatted text preview:

15-780: Grad AILecture 22: MDPsGeoff Gordon (this lecture)Tuomas Sandholm TAs Erik Zawadzki, Abe OthmanReview: Bayesian learningBayesian learning: P(θ | D) = P(D | θ) P(θ) / Z‣P(θ): prior over parametric model class‣P(D | θ): likelihood‣or, P(θ | X, Y) = P(Y | θ, X) P(θ) / Z as long as X ⊥ θPredictive distributionReview: Bayesian learningExact Bayes w/ conjugate prior, or numerical integration—this example: logistic regressionOr, MLE/MAP!"#!"$ !"%& &"' &"# &"$&"%!%!(!$!)!#!*!'!"# $%!&'(&&'(&'"&'$&')**'(Review: MDPsSequential decision problem under uncertaintyStates, actions, costs, transitions, discountingPolicy, execution traceState-value (J) and action-value (Q) function‣(1–γ) × immediate cost + γ × future costReview: MDPsTree searchReceding horizon tree search w/ heuristicDynamic programming (value iteration)Pruning (once we realize a branch is bad, or subsampling scenarios)Curse of dimensionalityAlternate algorithms for “small” systems—policy evaluationLinear equations: so, Gaussian elimination, biconjugate gradient, Gauss-Seidel iteration, …‣VI is essentially the Jacobi iterative method for matrix inversionStochastic-gradient-descent-like‣TD(λ), Q-learningQπ(s, a) = (1 − γ)C(s, a)+γE[Jπ(s�) | s�∼ T (·|s, a)]Jπ(s)=E[Qπ(s, a) | a ∼ π(·|s)]Alternate algorithms for “small” systems—policy optimizationPolicy iteration: alternately‣use any above method to evaluate current π‣replace π with greedy policy: at each state s, π(s) := arg maxa Q(s,a)Actor-critic: like policy iteration, but interleave solving for Jπ and updating π‣e.g., run biconjugate gradient for a few steps‣warm start: each Jπ probably similar to nextSARSA = AC w/ TD(λ) critic, ϵ-greedy policy(Stochastic) policy gradient‣pick a parameterized policy class πθ(a | s)‣compute or estimate g = ∇θ Jπ(s1)‣θ ← θ – ƞg, repeatMore detail:‣can estimate g quickly by simulating a few trajectories‣can also use natural gradient to get faster convergenceAlternate algorithms for “small” systems—policy optimizationLinear programming‣analogy: use an LP to compute min(3, 6, 5)‣note min v. maxmax J s.t.J ≤ 3J ≤ 6J ≤ 5Alternate algorithms for “small” systems—policy optimizationLinear programmingVariables J(s) and Q(s,a) for all s, aNote: dual of this LP is interesting‣generalizes single-source shortest pathsmax J(s1) s.t.Q(s, a) = (1 − γ)C(s, a)+γE[J(s�) | s�∼ T (·|s, a)]J(s) ≤ Q(s, a) ∀s, aModel requirementsWhat we have to know about the MDP in order to plan? ‣full model‣simulation model‣no model: only the real worldModel requirementsVI and LP require full modelPI and actor-critic inherit requirements of policy-evaluation subroutineTD(λ), SARSA, policy gradient: OK with simulation model or no model‣horribly data-inefficient if used directly on real world with no model—don’t do this!‣note: model can be just { all of my data }A word on performance measurementsMultiple criteria we might care about: ‣data (from real world)‣runtime‣calls to model (under some API)Measure convergence rate of:‣J(s) or Q(s, a)‣π(s)‣actual (expected total discounted) costBuilding a modelHow to handle lack of model without horrible data inefficiency? Build one!‣hard inference problem; getting it wrong is bad‣this is why { all of my data } is a popular modelWhat do we do with posterior over models?‣just use MAP model (“certainty equivalent”)‣compute posterior over π*: slow, still wrong‣even slower:‣except policy gradient (Ng’s helicopter)maxπE(Jπ(s) | data, model class)0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateAlgorithms for large systemsPolicy gradient: no changeAny value-based method: can’t even write down J(s) or Q(s,a)So,J(s)=�iwiφi(s)Q(s, a)=�iwiφi(s, a)Algorithms for large systemsEvaluation: TD(λ), LSTDOptimization:‣policy iteration or actor-critic‣e.g., LSTD → LSPI‣approximate LP‣value iteration: only special cases, e.g., finite-element gridLeast-squares temporal differences(LSTD)Data: τ = (s1, a1, c1, s2, a2, c2, …) ~ πWant Q(st, at) ≈ (1–γ)ct + γQ(st+1, at+1)‣wTΦ(st, at) ≈ (1–γ)ct + γwTΦ(st+1, at+1)‣Φ = vector of k features, w = weight vectorQπ(s, a) = (1 − γ)C(s, a)+γE[Jπ(s�) | s�∼ T (·|s, a)]Jπ(s)=E[Qπ(s, a) | a ∼ π(·|s)]LSTDwTΦ(st, at) ≈ (1–γ)ct + γwTΦ(st+1, at+1)Vector notation:‣Fw ≈ (1–γ)ct + γF1wOverconstrained: multiply both sides by F‣FTFw = (1–γ)FTct + γFTF1w0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSTD: example100 states in a line; move left or right at cost 1 per state; goals at both ends; discount 0.990 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSTD: example100 states in a line; move left or right at cost 1 per state; goals at both ends; discount 0.99LSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80


View Full Document

CMU CS 15780 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?