Page 1CS 287: Advanced RoboticsFall 2009Lecture 11: Reinforcement LearningPieter AbbeelUC Berkeley EECS[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]Reinforcement Learning Model: Markov decision process (S, A, T, R, γ) Goal: Find π that maximizes expected sum of rewards T and R might be unknownPage 2MDP (S, A, T, γ, R), goal: maxπE [ ∑tγtR(st, at) | π ] Cleaning robot Walking robot Pole balancing Games: tetris, backgammon Server management Shortest path problems Model for animals, peopleExamplesCanonical Example: Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Big rewards come at the endPage 3Solving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal In an MDP, we want an optimal policy π*: S → A A policy π gives an action for each state An optimal policy maximizes expected utility if followed Defines a reflex agentExample Optimal PoliciesR(s) = -2.0R(s) = -0.1R(s) = -0.04R(s) = -0.02Page 4 Recap and extend exact methods Value iteration Policy iteration Generalized policy iteration Linear programming [later] Additional challenges we will address by building on top of the above: Unknown transition model and reward function Very large state spacesOutline current and next few lecturesValue Iteration Algorithm: Start with V0(s) = 0 for all s. Given Vi, calculate the values for all states for depth i+1: This is called a value update or Bellman update/back-up Repeat until convergencePage 5Example: Bellman UpdatesExample: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimatesV2V3Page 6ConvergenceInfinity norm: V ∞= maxs|V (s)|Fact. Value iteration converges to the optimal value function V∗which satisfiesthe Bellman equation:∀s ∈ S : V∗(s) = maxas′T (s, a, s′)(R(s, a, s′) + γV∗(s′))Or in operator notation: V∗= T V∗where T denotes the Bellman operator.Fact. If an estimate V satisfies V − T V ∞≤ ǫ then we have thatV − V∗∞≤ǫ1−γPractice: Computing Actions Which action should we chose from state s: Given optimal values V*? = greedy action with respect to V* = action choice with one step lookahead w.r.t. V*12Page 7Policy Iteration Alternative approach: Step 1: Policy evaluation: calculate value function for a fixed policy (not optimal!) until convergence Step 2: Policy improvement: update policy using one-step lookahead with resulting converged (but not optimal!) value function Repeat steps until policy converges This is policy iteration It’s still optimal! Can converge faster under some conditions13Policy Iteration Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead14Page 8Comparison Value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) Policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Generalized policy iteration: General idea of two interacting processes revolving around an approximate policy and an approximate value Asynchronous versions: Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely
View Full Document