Berkeley COMPSCI 287 - Reinforcement Learning - D2749774

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 287> Reinforcement Learning

DOC PREVIEW

Berkeley COMPSCI 287 - Reinforcement Learning

School name University of California, Berkeley

Course Compsci 287- Advanced Robotics

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 11: Reinforcement LearningPieter AbbeelUC Berkeley EECS[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]Reinforcement Learning Model: Markov decision process (S, A, T, R, γ) Goal: Find π that maximizes expected sum of rewards T and R might be unknownPage 2MDP (S, A, T, γ, R), goal: maxπE [ ∑tγtR(st, at) | π ] Cleaning robot Walking robot Pole balancing Games: tetris, backgammon Server management Shortest path problems Model for animals, peopleExamplesCanonical Example: Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Big rewards come at the endPage 3Solving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal In an MDP, we want an optimal policy π*: S → A A policy π gives an action for each state An optimal policy maximizes expected utility if followed Defines a reflex agentExample Optimal PoliciesR(s) = -2.0R(s) = -0.1R(s) = -0.04R(s) = -0.02Page 4 Recap and extend exact methods Value iteration Policy iteration Generalized policy iteration Linear programming [later] Additional challenges we will address by building on top of the above: Unknown transition model and reward function Very large state spacesOutline current and next few lecturesValue Iteration Algorithm: Start with V0(s) = 0 for all s. Given Vi, calculate the values for all states for depth i+1: This is called a value update or Bellman update/back-up Repeat until convergencePage 5Example: Bellman UpdatesExample: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimatesV2V3Page 6ConvergenceInfinity norm: V ∞= maxs|V (s)|Fact. Value iteration converges to the optimal value function V∗which satisfiesthe Bellman equation:∀s ∈ S : V∗(s) = maxas′T (s, a, s′)(R(s, a, s′) + γV∗(s′))Or in operator notation: V∗= T V∗where T denotes the Bellman operator.Fact. If an estimate V satisfies V − T V ∞≤ ǫ then we have thatV − V∗∞≤ǫ1−γPractice: Computing Actions Which action should we chose from state s: Given optimal values V*? = greedy action with respect to V* = action choice with one step lookahead w.r.t. V*12Page 7Policy Iteration Alternative approach: Step 1: Policy evaluation: calculate value function for a fixed policy (not optimal!) until convergence Step 2: Policy improvement: update policy using one-step lookahead with resulting converged (but not optimal!) value function Repeat steps until policy converges This is policy iteration It’s still optimal! Can converge faster under some conditions13Policy Iteration Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead14Page 8Comparison Value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) Policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies Generalized policy iteration: General idea of two interacting processes revolving around an approximate policy and an approximate value Asynchronous versions: Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely

View Full Document