Page 1CS 287: Advanced RoboticsFall 2009Lecture 13: Reinforcement LearningPieter AbbeelUC Berkeley EECS Model-free approaches Recap TD(0) Sarsa Q learning TD(λ), sarsa(λ), Q(λ) Function approximation and TD TD GammonOutlinePage 2TD(0) for estimating VπNote: this is really VπStochastic version of the policy evaluation update:Problems with TD Value Learning TD value leaning is model-free for policy evaluation However, if we want to turn our value estimates into a policy---as required for a policy update step---we’re sunk: Idea: learn Q-values directly Makes action selection model-free too!Page 3 When experiencing st, at, st+1, rt+1, at+1perform the following “sarsa” update: Will find the Q values for the current policy π. How about Q(s,a) for action a inconsistent with the policy π at state s? Converges (w.p. 1) to Q function for current policy π for all states and actions *if* all states and actions are visited infinitely often (assuming proper step-sizing)Update Q values directlyQπ(st, at) ← (1 − α)Qπ(st, at) + α [r(st, at, st+1) + γQπ(st+1, at+1)]= Qπ(st, at) + α [r(st, at, st+1) + γQπ(st+1, at+1)−Qπ(st, at)] To ensure convergence for all Q(s,a) we need to visit every (s,a) infinitely often The policy π needs to include some randomness Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to some current policy This results in a new policy π’ We end up finding the Q values for this new policy π’Exploration aspectPage 4 Policy iteration iterates: Evaluate value of current policy Vπ Improve policy by choosing the greedy policy w.r.t. Vπ Answer: Using the epsilon greedy policies can be interpreted as running policy iteration w.r.t. a related MDP which differs slighty in its transition model: with probability ǫ the transition is according to a random action in the new MDPDoes policy iteration still work when we execute epsilon greedy policies? Recall: Generalized policy iteration methods: interleave policy improvement and policy evaluation and guaranteed to converge to the optimal policy as long as value for every state updated infinitely often Sarsa: continuously update the policy by choosing actions ǫ greedy w.r.t. the current Q functionNeed not wait till convergence with the policy improvement stepPage 5Sarsa: updates Q values directlySarsa converges w.p. 1 to an optimal policy and action-value function as long as all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, e.g., by having ǫ greedy policies with ǫ = 1 / t). Directly approximate the optimal Q function Q*: Compare to sarsa:Q learningQπ(st, at)←(1−α)Qπ(st, at) + α [r(st, at, st+1) + γQπ(st+1, at+1)]Q(st, at)←(1−α)Q(st, at) + αr(st, at, st+1) + maxaγQπ(st+1, a)Page 6Q learningQ-Learning Properties Will converge to optimal Q function if Every (s,a) visited infinitely often α is chosen to decay according to standard stochastic approximation requirements Neat property: learns optimal Q-values regardless of policy used to collect the experience “Off policy” method Strictly better than TD, sarsa? Some caveats.Page 7 Reward = 0 at goal; -100 in cliff region; -1 everywhere else ǫ = 0.1Behaviour of Q-learning vs. sarsaExploration / Exploitation Several schemes for forcing exploration Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? You do explore the space, but keep thrashing around once learning is done Takes a long time to explore certain spaces One solution: lower ε over time Another solution: exploration functionsPage 8Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important---for optimality guarantees: it should guarantee that every (s,a) is visited infinitely often _or_ that Q(s,a) is always optimistic)TD(λ) --- motivation (grid world)Page 9TD(λ) --- motivation t: t+1:+also perform: t+2:+also:TD(λ) “backward view”V (st)←V (st) + αγλδt+1V (st)←V (st) + α[R(st) + γV (st+1)−V (st)]V (st+1)←V (st+1) + α[R(st+1) + γV (st+2)−V (st+1)]V (st+2)←V (st+2) + α[R(st+2) + γV (st+3)−V (st+2)]V (st+1)←V (st+1) + αγλδt+2V (st)←V (st) + αγ2λ2δt+2Page 10V (st) ← V (st) + α [R(st) + γV (st+1) − V (st)] δtSimilarly, the update at the next time step isV (st+1) ← V (st+1) + α (R(st+1) + γV (st+2) − V (st+1) δt+1Note that at the next time step we update V (st+1). This (crudely speaking)results in having a better estimate of the value function for state st+1. TD(λ)takes advantage of the availability of this better estimate to improve the updatewe performed for V (st) in the previous step of the algorithm. Concretely, TD(λ)performs another update on V (st) to account for our improved estimate ofV (st+1) as follows:V (st) ← V (st) + αγλδt+1where λ is a fudge factor that determines how heavily we weight changes inthe value function for st+1.Similarly, at time t + 2 we perform the following set of updates:V (st+2) ← V (st+2) + α [R(st+2) + γV (st+2) − V (st+3)] δt+2...V (st+1) ← V (st+1) + αγλδt+2V (st) ← V (st) + α γ2λ2 e(st)δt+2The term e(st) is called the eligibility vector.TD(λ) --- backward view wordyTD(λ)Page 11TD(λ) --- exampleA BQ RK100 0 0 0 0 0Random walk over 19 states. Left and rightmost states are sinks. Rewards always zero, except when entering right sink. TD: Sample = λ ∈ [0,1] Forward view equivalent to backward view TD(λ) --- “forward view”V (st)←(1−α)V (st) + α sampleR(st) + γV (st+1)R(st) + γR(st+1) + γ2V (st+2)R(st) + γR(st+1) + γ2R(st+2) + γ3V (st+3)...R(st) + γR(st+1) + γ2R(st+2) + . . . + γTR(sT)Page 12Sarsa(λ)Watkins Q(λ)Page 13 What if a state is visited at two different times t1and t2? Recall TD(λ)Replacing tracesReplacing traces: example 1A BQ RK100 0 0 0
View Full Document