Berkeley COMPSCI 188 - MDPs (6PP) - D2633800

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 188> MDPs (6PP)

DOC PREVIEW

Berkeley COMPSCI 188 - MDPs (6PP)

School name University of California, Berkeley

Course Compsci 188- Introduction to Artificial Intelligence

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 188: Artificial IntelligenceFall 2011Lecture 9: MDPs9/22/2011Dan Klein – UC BerkeleyMany slides over the course adapted from either Stuart Russell or Andrew Moore2Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Small “living” reward each step Big rewards come at the end Goal: maximize sum of rewards*Recap: MDPs Markov decision processes: States S Actions A Transitions P(s’|s,a) (or T(s,a,s’)) Rewards R(s,a,s’) (and discount γ) Start state s0 Quantities: Policy = map of states to actions Episode = one run of an MDP Utility = sum of discounted rewards Values = expected future utility from a state Q-Values = expected future utility from a q-stateass, as,a,s’s’4[DEMO – MDP Quantities]Optimal Utilities The utility of a state s:V*(s) = expected utility starting in s and acting optimally The utility of a q-state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally The optimal policy:π*(s) = optimal action from state s5ass’s, a(s,a,s’) is a transitions,a,s’s is a state(s, a) is a q-stateBellman Equations Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values:Total optimal rewards = maximize over choice of (first action plus optimal future) Formally:ass, as,a,s’s’6Value Estimates Calculate estimates Vk*(s) Not the optimal value of s! The optimal value considering only next k time steps (k rewards) What you’d get with depth-k expectimax* As k → ∞, it approaches the optimal value* Almost solution: recursion (i.e. expectimax) Correct solution: dynamic programming7[DEMO -- Vk]2Value Iteration Idea: Start with V0*(s) = 0 for all s, which we know is right (why?) Given Vi*, calculate the values for all states for depth i+1: Throw out old vector Vi* Repeat until convergence This is called a value update or Bellman update Theorem: will converge to unique optimal values Basic idea: approximations get refined towards optimal values Policy may converge long before values do8Example: Bellman Updates9max happens for a=right, other actions not shownExample: γ=0.9, living reward=0, noise=0.2Example: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimatesV2V310Convergence* Define the max-norm: Theorem: For any two approximations U and V I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution Theorem: I.e. once the change in our approximation is small, it must also be close to correct11Practice: Computing Actions Which action should we chose from state s: Given optimal values V? Given optimal q-values Q? Lesson: actions are easier to select from Q’s!12[DEMO – MDP action selection]Utilities for a Fixed Policy Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy Define the utility of a state s, under a fixed policy π:Vπ(s) = expected total discounted rewards (return) starting in s and following π Recursive relation (one-step look-ahead / Bellman equation):π(s)ss, π(s)s, π(s),s’s’13[DEMO – Right-Only Policy]3Policy Evaluation How do we calculate the V’s for a fixed policy? Idea one: turn recursive equations into updates Idea two: it’s just a linear system, solve with Matlab (or whatever)14Policy Iteration Alternative approach for optimal values: Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values Repeat steps until policy converges This is policy iteration It’s still optimal! Can converge faster under some conditions15Policy Iteration Policy evaluation: with fixed current policy π, find values with simplified Bellman updates: Iterate until values converge Policy improvement: with fixed utilities, find the best action according to one-step look-ahead16Comparison Both VI and PI compute the same thing (optimal values for all states) In value iteration: Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (implicitly, based on current utilities) Tracking the policy isn’t necessary; we take the max In policy iteration: Several passes to update utilities with fixed policy After policy is evaluated, a new policy is chosen Both are dynamic programs for solving MDPs17Asynchronous Value Iteration* In value iteration, we update every state in each iteration Actually, any sequences of Bellman updates will converge if every state is visited infinitely often In fact, we can update the policy as seldom or often as we like, and we will still converge Idea: Update states whose value we expect to change:If is large then update predecessors of

View Full Document

Berkeley COMPSCI 188 - MDPs (6PP)

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

Berkeley COMPSCI 188 - MDPs (6PP)

Sign up for free to view:

Please select your school