CMU CS 15780 - Lecture - D214405

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15780> Lecture

CMU CS 15780 - Lecture

Course Cs 15780- Graduate Artificial Intelligence

Pages 27

Download Save

Unformatted text preview:

15-780: Grad AILecture 22: MDPsGeoff Gordon (this lecture)Tuomas Sandholm TAs Erik Zawadzki, Abe OthmanReview: Bayesian learningBayesian learning: P(θ | D) = P(D | θ) P(θ) / Z‣P(θ): prior over parametric model class‣P(D | θ): likelihood‣or, P(θ | X, Y) = P(Y | θ, X) P(θ) / Z as long as X ⊥ θPredictive distributionReview: Bayesian learningExact Bayes w/ conjugate prior, or numerical integration—this example: logistic regressionOr, MLE/MAP!"#!"$ !"%& &"' &"# &"$&"%!%!(!$!)!#!*!'!"# $%!&'(&&'(&'"&'$&')**'(Review: MDPsSequential decision problem under uncertaintyStates, actions, costs, transitions, discountingPolicy, execution traceState-value (J) and action-value (Q) function‣(1–γ) × immediate cost + γ × future costReview: MDPsTree searchReceding horizon tree search w/ heuristicDynamic programming (value iteration)Pruning (once we realize a branch is bad, or subsampling scenarios)Curse of dimensionalityAlternate algorithms for “small” systems—policy evaluationLinear equations: so, Gaussian elimination, biconjugate gradient, Gauss-Seidel iteration, …‣VI is essentially the Jacobi iterative method for matrix inversionStochastic-gradient-descent-like‣TD(λ), Q-learningQπ(s, a) = (1 − γ)C(s, a)+γE[Jπ(s�) | s�∼ T (·|s, a)]Jπ(s)=E[Qπ(s, a) | a ∼ π(·|s)]Alternate algorithms for “small” systems—policy optimizationPolicy iteration: alternately‣use any above method to evaluate current π‣replace π with greedy policy: at each state s, π(s) := arg maxa Q(s,a)Actor-critic: like policy iteration, but interleave solving for Jπ and updating π‣e.g., run biconjugate gradient for a few steps‣warm start: each Jπ probably similar to nextSARSA = AC w/ TD(λ) critic, ϵ-greedy policy(Stochastic) policy gradient‣pick a parameterized policy class πθ(a | s)‣compute or estimate g = ∇θ Jπ(s1)‣θ ← θ – ƞg, repeatMore detail:‣can estimate g quickly by simulating a few trajectories‣can also use natural gradient to get faster convergenceAlternate algorithms for “small” systems—policy optimizationLinear programming‣analogy: use an LP to compute min(3, 6, 5)‣note min v. maxmax J s.t.J ≤ 3J ≤ 6J ≤ 5Alternate algorithms for “small” systems—policy optimizationLinear programmingVariables J(s) and Q(s,a) for all s, aNote: dual of this LP is interesting‣generalizes single-source shortest pathsmax J(s1) s.t.Q(s, a) = (1 − γ)C(s, a)+γE[J(s�) | s�∼ T (·|s, a)]J(s) ≤ Q(s, a) ∀s, aModel requirementsWhat we have to know about the MDP in order to plan? ‣full model‣simulation model‣no model: only the real worldModel requirementsVI and LP require full modelPI and actor-critic inherit requirements of policy-evaluation subroutineTD(λ), SARSA, policy gradient: OK with simulation model or no model‣horribly data-inefficient if used directly on real world with no model—don’t do this!‣note: model can be just { all of my data }A word on performance measurementsMultiple criteria we might care about: ‣data (from real world)‣runtime‣calls to model (under some API)Measure convergence rate of:‣J(s) or Q(s, a)‣π(s)‣actual (expected total discounted) costBuilding a modelHow to handle lack of model without horrible data inefficiency? Build one!‣hard inference problem; getting it wrong is bad‣this is why { all of my data } is a popular modelWhat do we do with posterior over models?‣just use MAP model (“certainty equivalent”)‣compute posterior over π*: slow, still wrong‣even slower:‣except policy gradient (Ng’s helicopter)maxπE(Jπ(s) | data, model class)0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateAlgorithms for large systemsPolicy gradient: no changeAny value-based method: can’t even write down J(s) or Q(s,a)So,J(s)=�iwiφi(s)Q(s, a)=�iwiφi(s, a)Algorithms for large systemsEvaluation: TD(λ), LSTDOptimization:‣policy iteration or actor-critic‣e.g., LSTD → LSPI‣approximate LP‣value iteration: only special cases, e.g., finite-element gridLeast-squares temporal differences(LSTD)Data: τ = (s1, a1, c1, s2, a2, c2, …) ~ πWant Q(st, at) ≈ (1–γ)ct + γQ(st+1, at+1)‣wTΦ(st, at) ≈ (1–γ)ct + γwTΦ(st+1, at+1)‣Φ = vector of k features, w = weight vectorQπ(s, a) = (1 − γ)C(s, a)+γE[Jπ(s�) | s�∼ T (·|s, a)]Jπ(s)=E[Qπ(s, a) | a ∼ π(·|s)]LSTDwTΦ(st, at) ≈ (1–γ)ct + γwTΦ(st+1, at+1)Vector notation:‣Fw ≈ (1–γ)ct + γF1wOverconstrained: multiply both sides by F‣FTFw = (1–γ)FTct + γFTF1w0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSTD: example100 states in a line; move left or right at cost 1 per state; goals at both ends; discount 0.990 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSTD: example100 states in a line; move left or right at cost 1 per state; goals at both ends; discount 0.99LSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80 100ï0.500.511.5J(s)StateLSPI0 20 40 60 80 100ï0.500.511.5qi(s)0 20 40 60 80

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU CS 15780 - Lecture

Sign up for free to view:

Please select your school