Berkeley COMPSCI 287 - Lecture 13: Reinforcement Learning - D2957837

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 287> Lecture 13: Reinforcement Learning

DOC PREVIEW

Berkeley COMPSCI 287 - Lecture 13: Reinforcement Learning

School name University of California, Berkeley

Course Compsci 287- Advanced Robotics

Pages 14

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 13: Reinforcement LearningPieter AbbeelUC Berkeley EECS Model-free approaches Recap TD(0) Sarsa Q learning TD(λ), sarsa(λ), Q(λ) Function approximation and TD TD GammonOutlinePage 2TD(0) for estimating VπNote: this is really VπStochastic version of the policy evaluation update:Problems with TD Value Learning TD value leaning is model-free for policy evaluation However, if we want to turn our value estimates into a policy---as required for a policy update step---we’re sunk: Idea: learn Q-values directly Makes action selection model-free too!Page 3 When experiencing st, at, st+1, rt+1, at+1perform the following “sarsa” update: Will find the Q values for the current policy π. How about Q(s,a) for action a inconsistent with the policy π at state s? Converges (w.p. 1) to Q function for current policy π for all states and actions *if* all states and actions are visited infinitely often (assuming proper step-sizing)Update Q values directlyQπ(st, at) ← (1 − α)Qπ(st, at) + α [r(st, at, st+1) + γQπ(st+1, at+1)]= Qπ(st, at) + α [r(st, at, st+1) + γQπ(st+1, at+1)−Qπ(st, at)] To ensure convergence for all Q(s,a) we need to visit every (s,a) infinitely often The policy π needs to include some randomness Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to some current policy  This results in a new policy π’ We end up finding the Q values for this new policy π’Exploration aspectPage 4 Policy iteration iterates: Evaluate value of current policy Vπ Improve policy by choosing the greedy policy w.r.t. Vπ Answer: Using the epsilon greedy policies can be interpreted as running policy iteration w.r.t. a related MDP which differs slighty in its transition model: with probability ǫ the transition is according to a random action in the new MDPDoes policy iteration still work when we execute epsilon greedy policies? Recall: Generalized policy iteration methods: interleave policy improvement and policy evaluation and guaranteed to converge to the optimal policy as long as value for every state updated infinitely often  Sarsa: continuously update the policy by choosing actions ǫ greedy w.r.t. the current Q functionNeed not wait till convergence with the policy improvement stepPage 5Sarsa: updates Q values directlySarsa converges w.p. 1 to an optimal policy and action-value function as long as all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, e.g., by having ǫ greedy policies with ǫ = 1 / t). Directly approximate the optimal Q function Q*: Compare to sarsa:Q learningQπ(st, at)←(1−α)Qπ(st, at) + α [r(st, at, st+1) + γQπ(st+1, at+1)]Q(st, at)←(1−α)Q(st, at) + αr(st, at, st+1) + maxaγQπ(st+1, a)Page 6Q learningQ-Learning Properties Will converge to optimal Q function if Every (s,a) visited infinitely often α is chosen to decay according to standard stochastic approximation requirements Neat property: learns optimal Q-values regardless of policy used to collect the experience “Off policy” method Strictly better than TD, sarsa? Some caveats.Page 7 Reward = 0 at goal; -100 in cliff region; -1 everywhere else ǫ = 0.1Behaviour of Q-learning vs. sarsaExploration / Exploitation Several schemes for forcing exploration Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? You do explore the space, but keep thrashing around once learning is done Takes a long time to explore certain spaces One solution: lower ε over time Another solution: exploration functionsPage 8Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important---for optimality guarantees: it should guarantee that every (s,a) is visited infinitely often _or_ that Q(s,a) is always optimistic)TD(λ) --- motivation (grid world)Page 9TD(λ) --- motivation t: t+1:+also perform: t+2:+also:TD(λ) “backward view”V (st)←V (st) + αγλδt+1V (st)←V (st) + α[R(st) + γV (st+1)−V (st)]V (st+1)←V (st+1) + α[R(st+1) + γV (st+2)−V (st+1)]V (st+2)←V (st+2) + α[R(st+2) + γV (st+3)−V (st+2)]V (st+1)←V (st+1) + αγλδt+2V (st)←V (st) + αγ2λ2δt+2Page 10V (st) ← V (st) + α [R(st) + γV (st+1) − V (st)]  δtSimilarly, the update at the next time step isV (st+1) ← V (st+1) + α (R(st+1) + γV (st+2) − V (st+1)  δt+1Note that at the next time step we update V (st+1). This (crudely speaking)results in having a better estimate of the value function for state st+1. TD(λ)takes advantage of the availability of this better estimate to improve the updatewe performed for V (st) in the previous step of the algorithm. Concretely, TD(λ)performs another update on V (st) to account for our improved estimate ofV (st+1) as follows:V (st) ← V (st) + αγλδt+1where λ is a fudge factor that determines how heavily we weight changes inthe value function for st+1.Similarly, at time t + 2 we perform the following set of updates:V (st+2) ← V (st+2) + α [R(st+2) + γV (st+2) − V (st+3)]  δt+2...V (st+1) ← V (st+1) + αγλδt+2V (st) ← V (st) + α γ2λ2 e(st)δt+2The term e(st) is called the eligibility vector.TD(λ) --- backward view wordyTD(λ)Page 11TD(λ) --- exampleA BQ RK100 0 0 0 0 0Random walk over 19 states. Left and rightmost states are sinks. Rewards always zero, except when entering right sink. TD: Sample = λ ∈ [0,1] Forward view equivalent to backward view TD(λ) --- “forward view”V (st)←(1−α)V (st) + α sampleR(st) + γV (st+1)R(st) + γR(st+1) + γ2V (st+2)R(st) + γR(st+1) + γ2R(st+2) + γ3V (st+3)...R(st) + γR(st+1) + γ2R(st+2) + . . . + γTR(sT)Page 12Sarsa(λ)Watkins Q(λ)Page 13 What if a state is visited at two different times t1and t2? Recall TD(λ)Replacing tracesReplacing traces: example 1A BQ RK100 0 0 0

View Full Document