Page 1CS 287: Advanced RoboticsFall 2009Lecture 12: Reinforcement LearningPieter AbbeelUC Berkeley EECS LP approach for finding the optimal value function of MDPs Model-free approachesOutlinePage 2Solving an MDP with linear programmingSolving an MDP with linear programmingPage 3Solving an MDP with linear programmingThe dual LPPage 4 Meaning λ(s,a) ? Meaning c(s) ?The dual LP: interpretationmaxλ≥0s,a,s′T (s, a, s′)λ(s, a)R(s, a, s′)s.t. ∀saλ(s, a) = c(s) +s′,aλ(s′, a)T (s′, a, s)The optimal value function satisfies:∀s : V (s) = maxas′T (s, a, s′) [R(s, a, s′) + γV (s′)] .We can relax these non-linear equality constraints to inequality constraints:∀s : V (s) ≥ maxas′T (s, a, s′) [R(s, a, s′) + γV (s′)] .Equivalently, (x ≥ maxiyiis equivalent to ∀i x ≥ yi), we have:∀s, ∀a : V (s) ≥s′T (s, a, s′) [R(s, a, s′) + γV (s′)] . (1)The relaxation still has the optimal value function as one of its solutions, but wemight have introduced new solutions. So we look for an objective function thatwill favor the optimal value function over other solutions of (1). To this extent,we observed the following monotonicity property of the Bellman operator T :∀s V1(s) ≥ V2(s) implies : ∀s (T V1)(s) ≥ (TV2)(s)Any solution to (1) satisfies V ≥ TV , hence also: T V ≥ T2V , hence also:T2V ≥ T3V ... “T∞−1”V ≥ T∞V = V∗. Stringing these together, we get forany solution V of (1) that the following holds:V ≥ V∗Hence to find V∗as the solution to (1), it suffices to add an objective functionwhich favors the smallest solution:minVc⊤V s.t.∀s, ∀a : V (s) ≥s′T (s, a, s′) [R(s, a, s′) + γV (s′)] . (2)If c(s) > 0 for all s, the unique solution to (2) is V∗.Taking the Lagrange dual of (2), we obtain another interesting LP:maxλ≥0s,a,s′T (s, a, s′)λ(s, a)R(s, a, s′)s.t. ∀saλ(s, a) = c(s) + γs′,aλ(s′, a)T (s′, a, s)LP approach recapPage 5 PS 1: posted on class website, due Monday October 26. Final project abstracts due tomorrow.Announcements Value iteration: Start with V0(s) = 0 for all s. Iterate until convergence: Policy iteration: Policy evaluation: Iterate until values converge Policy improvement: Generalized policy iteration: Any interleaving of policy evaluation and policy improvement Note: for particular choice of interleaving value iteration Linear programming:Page 6 Model-based reinforcement learning Estimate model from experience Solve the MDP as if the model were correct Model-free reinforcement learning Adaptations of the exact algorithms which only require (s, a, r, s’) traces [some of them use (s, a, r, s’, a’)] No model is built in the processWhat if T and R unknownSample Avg to Replace Expectation? Who needs T and R? Approximate the expectation with samples (drawn from T!)Problem: We need to estimate these too!Page 7Sample Avg to Replace Expectation?Sample of V(s):Update to V(s):Same update: We could estimate Vπ(s) for all states simultaneously: Old updates will use very poor estimates of Vπ(s’) This will surely affect our estimates of Vπ(s) initially, but will this also affect our final estimate?Sample Avg to Replace Expectation? Big idea: why bother learning T? Update V(s) each time we experience (s,a,s’) Likely s’ will contribute updates more often Temporal difference learning ( TD or TD(0) ) Policy still fixed! Move values toward value of whatever successor occurs: running average!Sample of V(s):Update to V(s):Same update:Page 8Exponential Moving Average Weighted averages emphasize certain samples Exponential moving average Makes recent samples more important Forgets about the past (which contains mistakes in TD) Easy to compute from the running average Decreasing learning rate can give converging averagesTD(0) for estimating VπNote: this is really VπPage 9 Convergence with probability 1 for the states which are visited infinitely often if the step-size parameter decreases according to the “usual” stochastic approximation conditions Examples: 1/k C/(C+k)Convergence guarantees for TD(0)∞k=0αk= ∞∞k=0α2k< ∞ If limited number of trials available: could repeatedly go through the data and perform the TD updates again Under this procedure, the values will converge to the values under the empirical transition and reward model.Experience
View Full Document