DOC PREVIEW
Berkeley COMPSCI 188 - Lecture 9: MDPs

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 188: Artificial IntelligenceFall 2009Lecture 9: MDPs9/24/2009Dan Klein – UC BerkeleyMany slides over the course adapted from either Stuart Russell or Andrew Moore1Announcements Assignments W1 due today (drop box in 283 Soda or after lecture) P2 due on 9/30 (Wednesday) P3 out now, due 10/12 Readings: For MDPs / reinforcement learning, we’re using an online reading Different treatment and notation than the R&N book, beware! Lecture version is the standard for this class Contest is live!2Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act so as to maximize expected rewards[DEMOS]Grid World The agent lives in a grid Walls block the agent’s path The agent’s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Small “living” reward each step Big rewards come at the end Goal: maximize sum of rewards*[DEMO – Gridworld Intro]Markov Decision Processes An MDP is defined by: A set of states s ∈ S A set of actions a ∈ A A transition function T(s,a,s’) Prob that a from s leads to s’ i.e., P(s’ | s,a) Also called the model A reward function R(s, a, s’)  Sometimes just R(s) or R(s’) A start state (or distribution) Maybe a terminal state MDPs are a family of non-deterministic search problems Reinforcement learning: MDPs where we don’t know the transition or reward functions5What is Markov about MDPs? Andrey Markov (1856-1922) “Markov” generally means that given the present state, the future and the past are independent For Markov decision processes, “Markov” means:2Solving MDPs In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal In an MDP, we want an optimal policy π*: S → A A policy π gives an action for each state An optimal policy maximizes expected utility if followed Defines a reflex agentOptimal policy when R(s, a, s’) = -0.03 for all non-terminals s[Demo]Example Optimal PoliciesR(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.018Example: High-Low Three card types: 2, 3, 4 Infinite deck, twice as many 2’s Start with 3 showing After each card, you say “high” or “low” New card is flipped If you’re right, you win the points shown on the new card Ties are no-ops If you’re wrong, game ends Differences from expectimax:  #1: get rewards as you go #2: you might play forever!39High-Low as an MDP States: 2, 3, 4, done Actions: High, Low Model: T(s, a, s’): P(s’=4 | 4, Low) = 1/4  P(s’=3 | 4, Low) = 1/4 P(s’=2 | 4, Low) = 1/2 P(s’=done | 4, Low) = 0 P(s’=4 | 4, High) = 1/4  P(s’=3 | 4, High) = 0 P(s’=2 | 4, High) = 0 P(s’=done | 4, High) = 3/4 … Rewards: R(s, a, s’): Number shown on s’ if s ≠ s’ 0 otherwise Start: 33Example: High-LowLowHighHigh LowHigh LowHighLow, Low, HighT = 0.5, R = 2T = 0.25, R = 3T = 0, R = 4T = 0.25, R = 011MDP Search Trees Each MDP state gives an expectimax-like search treeass’s, a(s,a,s’) called a transitionT(s,a,s’) = P(s’|s,a)R(s,a,s’)s,a,s’s is a state(s, a) is a q-state123Utilities of Sequences In order to formalize optimality of a policy, need to understand utilities of sequences of rewards Typically consider stationary preferences: Theorem: only two ways to define stationary utilities Additive utility: Discounted utility:13Infinite Utilities?! Problem: infinite state sequences have infinite rewards Solutions: Finite horizon: Terminate episodes after a fixed T steps (e.g. life) Gives nonstationary policies (π depends on time left) Absorbing state: guarantee that for every policy, a terminal state will eventually be reached (like “done” for High-Low) Discounting: for 0 < γ < 1 Smaller γ means smaller “horizon” – shorter term focus14Discounting Typically discount rewards by γ < 1 each time step Sooner rewards have higher utility than later rewards Also helps the algorithms converge15Recap: Defining MDPs Markov decision processes: States S Start state s0 Actions A Transitions P(s’|s,a) (or T(s,a,s’)) Rewards R(s,a,s’) (and discount γ) MDP quantities so far: Policy = Choice of action for each state Utility (or return) = sum of discounted rewardsass, as,a,s’s’16Optimal Utilities Fundamental operation: compute the values (optimal expectimaxutilities) of states s Why? Optimal values define optimal policies! Define the value of a state s:V*(s) = expected utility starting in s and acting optimally Define the value of a q-state (s,a):Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally Define the optimal policy:π*(s) = optimal action from state sass, as,a,s’s’17[DEMO – Grid Values]The Bellman Equations Definition of “optimal utility” leads to a simple one-step lookahead relationship amongst optimal utility values:Optimal rewards = maximize over first action and then follow optimal policy Formally:ass, as,a,s’s’184Solving MDPs We want to find the optimal policy π* Proposal 1: modified expectimax search, starting from each state s:ass, as,a,s’s’19Why Not Search Trees? Why not solve with expectimax? Problems: This tree is usually infinite (why?) Same states appear over and over (why?) We would search once per state (why?) Idea: Value iteration Compute optimal values for all states all at once using successive approximations Will be a bottom-up dynamic program similar in cost to memoization Do all planning offline, no replanning needed!20Value Estimates Calculate estimates Vk*(s) Not the optimal value of s! The optimal value considering only next k time steps (k rewards) As k → ∞, it approaches the optimal value Why: If discounting, distant rewards become negligible If terminal states reachable from everywhere, fraction of episodes not ending becomes negligible Otherwise, can get infinite expected utility and then this approach actually won’t work21Value Iteration Idea: Start


View Full Document

Berkeley COMPSCI 188 - Lecture 9: MDPs

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Lecture 9: MDPs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 9: MDPs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 9: MDPs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?