Berkeley COMPSCI 287 - Lecture 11: Reinforcement Learning - D1588609

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 287> Lecture 11: Reinforcement Learning

DOC PREVIEW

Berkeley COMPSCI 287 - Lecture 11: Reinforcement Learning

School name University of California, Berkeley

Course Compsci 287- Advanced Robotics

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 11: Reinforcement LearningPieter AbbeelUC Berkeley EECS[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]Reinforcement Learning Model: Markov decision process (S, A, T, R, γ) Goal: Find π that maximizes expected sum of rewards T and R might be unknownMDP (S, A, T, γ, R), goal: maxπE [ ∑tγtR(st, at) | π ] Cleaning robot Walking robot Pole balancing Games: tetris, backgammon Server management Shortest path problems Model for animals, peopleExamplesCanonical Example: Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Big rewards come at the endSolving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal In an MDP, we want an optimal policy π*: S → A A policy π gives an action for each state An optimal policy maximizes expected utility if followed Defines a reflex agentExample Optimal PoliciesR(s) = -2.0R(s) = -0.1R(s) = -0.04R(s) = -0.02Page 2 Recap and extend exact methods Value iteration Policy iteration Generalized policy iteration Linear programming [later] Additional challenges we will address by building on top of the above: Unknown transition model and reward function Very large state spacesOutline current and next few lecturesValue Iteration Algorithm: Start with V0(s) = 0 for all s. Given Vi, calculate the values for all states for depth i+1: This is called a value update or Bellman update/back-up Repeat until convergenceExample: Bellman UpdatesExample: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimatesV2V3ConvergenceInfinity norm: V ∞= maxs|V (s)|Fact. Value iteration converges to the optimal value function V∗which satisfiesthe Bellman equation:∀s ∈ S : V∗(s) = maxas′T (s, a, s′)(R(s, a, s′) + γV∗(s′))Or in operator notation: V∗= TV∗where T denotes the Bellman operator.Fact. If an estimate V satisfies V − T V ∞≤ ǫ then we have thatV − V∗∞≤ǫ1−γPractice: Computing Actions Which action should we chose from state s: Given optimal values V*? = greedy action with respect to V* = action choice with one step lookahead w.r.t. V*12Page 3Policy Iteration Alternative approach: Step 1: Policy evaluation: calculate value function for a fixed policy (not optimal!) until convergence Step 2: Policy improvement: update policy using one-step lookahead with resulting converged (but not optimal!) value function Repeat steps until policy converges This is policy iteration It’s still optimal! Can converge faster under some conditions13Policy Iteration Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead14Comparison Value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) Policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Generalized policy iteration: General idea of two interacting processes revolving around an approximate policy and an approximate value Asynchronous versions: Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely

View Full Document