DOC PREVIEW
Berkeley COMPSCI 287 - Reinforcement Learning

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 11: Reinforcement LearningPieter AbbeelUC Berkeley EECS[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]Reinforcement Learning Model: Markov decision process (S, A, T, R, γ) Goal: Find π that maximizes expected sum of rewards T and R might be unknownPage 2MDP (S, A, T, γ, R), goal: maxπE [ ∑tγtR(st, at) | π ] Cleaning robot Walking robot Pole balancing Games: tetris, backgammon Server management Shortest path problems Model for animals, peopleExamplesCanonical Example: Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Big rewards come at the endPage 3Solving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal In an MDP, we want an optimal policy π*: S → A A policy π gives an action for each state An optimal policy maximizes expected utility if followed Defines a reflex agentExample Optimal PoliciesR(s) = -2.0R(s) = -0.1R(s) = -0.04R(s) = -0.02Page 4 Recap and extend exact methods Value iteration Policy iteration Generalized policy iteration Linear programming [later] Additional challenges we will address by building on top of the above: Unknown transition model and reward function Very large state spacesOutline current and next few lecturesValue Iteration Algorithm: Start with V0(s) = 0 for all s. Given Vi, calculate the values for all states for depth i+1: This is called a value update or Bellman update/back-up Repeat until convergencePage 5Example: Bellman UpdatesExample: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimatesV2V3Page 6ConvergenceInfinity norm: V ∞= maxs|V (s)|Fact. Value iteration converges to the optimal value function V∗which satisfiesthe Bellman equation:∀s ∈ S : V∗(s) = maxas′T (s, a, s′)(R(s, a, s′) + γV∗(s′))Or in operator notation: V∗= T V∗where T denotes the Bellman operator.Fact. If an estimate V satisfies V − T V ∞≤ ǫ then we have thatV − V∗∞≤ǫ1−γPractice: Computing Actions Which action should we chose from state s: Given optimal values V*? = greedy action with respect to V* = action choice with one step lookahead w.r.t. V*12Page 7Policy Iteration Alternative approach: Step 1: Policy evaluation: calculate value function for a fixed policy (not optimal!) until convergence Step 2: Policy improvement: update policy using one-step lookahead with resulting converged (but not optimal!) value function Repeat steps until policy converges This is policy iteration It’s still optimal! Can converge faster under some conditions13Policy Iteration Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead14Page 8Comparison Value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) Policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Generalized policy iteration: General idea of two interacting processes revolving around an approximate policy and an approximate value Asynchronous versions: Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely


View Full Document
Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?