DOC PREVIEW
Berkeley COMPSCI 287 - Lecture 12: Reinforcement Learning

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 12: Reinforcement LearningPieter AbbeelUC Berkeley EECS LP approach for finding the optimal value function of MDPs Model-free approachesOutlinePage 2Solving an MDP with linear programmingSolving an MDP with linear programmingPage 3Solving an MDP with linear programmingThe dual LPPage 4 Meaning λ(s,a) ? Meaning c(s) ?The dual LP: interpretationmaxλ≥0s,a,s′T (s, a, s′)λ(s, a)R(s, a, s′)s.t. ∀saλ(s, a) = c(s) +s′,aλ(s′, a)T (s′, a, s)The optimal value function satisfies:∀s : V (s) = maxas′T (s, a, s′) [R(s, a, s′) + γV (s′)] .We can relax these non-linear equality constraints to inequality constraints:∀s : V (s) ≥ maxas′T (s, a, s′) [R(s, a, s′) + γV (s′)] .Equivalently, (x ≥ maxiyiis equivalent to ∀i x ≥ yi), we have:∀s, ∀a : V (s) ≥s′T (s, a, s′) [R(s, a, s′) + γV (s′)] . (1)The relaxation still has the optimal value function as one of its solutions, but wemight have introduced new solutions. So we look for an objective function thatwill favor the optimal value function over other solutions of (1). To this extent,we observed the following monotonicity property of the Bellman operator T :∀s V1(s) ≥ V2(s) implies : ∀s (T V1)(s) ≥ (TV2)(s)Any solution to (1) satisfies V ≥ TV , hence also: T V ≥ T2V , hence also:T2V ≥ T3V ... “T∞−1”V ≥ T∞V = V∗. Stringing these together, we get forany solution V of (1) that the following holds:V ≥ V∗Hence to find V∗as the solution to (1), it suffices to add an objective functionwhich favors the smallest solution:minVc⊤V s.t.∀s, ∀a : V (s) ≥s′T (s, a, s′) [R(s, a, s′) + γV (s′)] . (2)If c(s) > 0 for all s, the unique solution to (2) is V∗.Taking the Lagrange dual of (2), we obtain another interesting LP:maxλ≥0s,a,s′T (s, a, s′)λ(s, a)R(s, a, s′)s.t. ∀saλ(s, a) = c(s) + γs′,aλ(s′, a)T (s′, a, s)LP approach recapPage 5 PS 1: posted on class website, due Monday October 26. Final project abstracts due tomorrow.Announcements Value iteration: Start with V0(s) = 0 for all s. Iterate until convergence: Policy iteration: Policy evaluation: Iterate until values converge Policy improvement: Generalized policy iteration: Any interleaving of policy evaluation and policy improvement Note: for particular choice of interleaving  value iteration Linear programming:Page 6 Model-based reinforcement learning Estimate model from experience Solve the MDP as if the model were correct Model-free reinforcement learning Adaptations of the exact algorithms which only require (s, a, r, s’) traces [some of them use (s, a, r, s’, a’)] No model is built in the processWhat if T and R unknownSample Avg to Replace Expectation? Who needs T and R? Approximate the expectation with samples (drawn from T!)Problem: We need to estimate these too!Page 7Sample Avg to Replace Expectation?Sample of V(s):Update to V(s):Same update: We could estimate Vπ(s) for all states simultaneously: Old updates will use very poor estimates of Vπ(s’) This will surely affect our estimates of Vπ(s) initially, but will this also affect our final estimate?Sample Avg to Replace Expectation? Big idea: why bother learning T? Update V(s) each time we experience (s,a,s’) Likely s’ will contribute updates more often Temporal difference learning ( TD or TD(0) ) Policy still fixed! Move values toward value of whatever successor occurs: running average!Sample of V(s):Update to V(s):Same update:Page 8Exponential Moving Average Weighted averages emphasize certain samples Exponential moving average  Makes recent samples more important Forgets about the past (which contains mistakes in TD) Easy to compute from the running average  Decreasing learning rate can give converging averagesTD(0) for estimating VπNote: this is really VπPage 9 Convergence with probability 1 for the states which are visited infinitely often if the step-size parameter decreases according to the “usual” stochastic approximation conditions Examples:  1/k C/(C+k)Convergence guarantees for TD(0)∞k=0αk= ∞∞k=0α2k< ∞ If limited number of trials available: could repeatedly go through the data and perform the TD updates again Under this procedure, the values will converge to the values under the empirical transition and reward model.Experience


View Full Document
Download Lecture 12: Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 12: Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 12: Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?