DOC PREVIEW
Berkeley COMPSCI 188 - Lecture 9: MDPs

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 188: Artificial IntelligenceFall 2011Lecture 9: MDPs9/22/2011Dan Klein – UC BerkeleyMany slides over the course adapted from either Stuart Russell or Andrew Moore2Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Small “living” reward each step Big rewards come at the end Goal: maximize sum of rewards*2Recap: MDPs Markov decision processes: States S Actions A Transitions P(s’|s,a) (or T(s,a,s’)) Rewards R(s,a,s’) (and discount γ) Start state s0 Quantities: Policy = map of states to actions Episode = one run of an MDP Utility = sum of discounted rewards Values = expected future utility from a state Q-Values = expected future utility from a q-stateass, as,a,s’s’4[DEMO – MDP Quantities]Optimal Utilities The utility of a state s:V*(s) = expected utility starting in s and acting optimally The utility of a q-state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally The optimal policy:π*(s) = optimal action from state s5ass’s, a(s,a,s’) is a transitions,a,s’s is a state(s, a) is a q-state3Bellman Equations Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values:Total optimal rewards = maximize over choice of (first action plus optimal future) Formally:ass, as,a,s’s’6Value Estimates Calculate estimates Vk*(s) Not the optimal value of s! The optimal value considering only next k time steps (k rewards) What you’d get with depth-k expectimax* As k → ∞, it approaches the optimal value* Almost solution: recursion (i.e. expectimax) Correct solution: dynamic programming7[DEMO -- Vk]4Value Iteration Idea: Start with V0*(s) = 0 for all s, which we know is right (why?) Given Vi*, calculate the values for all states for depth i+1: Throw out old vector Vi* Repeat until convergence This is called a value update or Bellman update Theorem: will converge to unique optimal values Basic idea: approximations get refined towards optimal values Policy may converge long before values do8Example: Bellman Updates9max happens for a=right, other actions not shownExample: γ=0.9, living reward=0, noise=0.25Example: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimatesV2V310Convergence* Define the max-norm: Theorem: For any two approximations U and V I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution Theorem: I.e. once the change in our approximation is small, it must also be close to correct116Practice: Computing Actions Which action should we chose from state s: Given optimal values V? Given optimal q-values Q? Lesson: actions are easier to select from Q’s!12[DEMO – MDP action selection]Utilities for a Fixed Policy Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy Define the utility of a state s, under a fixed policy π:Vπ(s) = expected total discounted rewards (return) starting in s and following π Recursive relation (one-step look-ahead / Bellman equation):π(s)ss, π(s)s, π(s),s’s’13[DEMO – Right-Only Policy]7Policy Evaluation How do we calculate the V’s for a fixed policy? Idea one: turn recursive equations into updates Idea two: it’s just a linear system, solve with Matlab (or whatever)14Policy Iteration Alternative approach for optimal values: Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values Repeat steps until policy converges This is policy iteration It’s still optimal! Can converge faster under some conditions158Policy Iteration Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead16Comparison Both VI and PI compute the same thing (optimal values for all states) In value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (implicitly, based on current utilities) Tracking the policy isn’t necessary; we take the max In policy iteration: Several passes to update utilities with fixed policy After policy is evaluated, a new policy is chosen Both are dynamic programs for solving MDPs179Asynchronous Value Iteration* In value iteration, we update every state in each iteration Actually, any sequences of Bellman updates will converge if every state is visited infinitely often In fact, we can update the policy as seldom or often as we like, and we will still converge Idea: Update states whose value we expect to change:If is large then update predecessors of


View Full Document

Berkeley COMPSCI 188 - Lecture 9: MDPs

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Lecture 9: MDPs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 9: MDPs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 9: MDPs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?