DOC PREVIEW
MIT 16 412J - Markov Decision Processes

This preview shows page 1-2-3-21-22-23-42-43-44 out of 44 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 44 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

PowerPoint PresentationMarkov Decision ProcessesReading and AssignmentsHow Might a Mouse Search a Maze for Cheese?Ideas in this lectureSlide 6MDP ProblemMDP Problem: ModelMarkov Decision Processes (MDPs)MDP Environment AssumptionsSlide 12MDP Problem: Lifetime RewardLifetime RewardMDP Problem:MDP Problem: PolicySlide 17Slide 18Slide 19Value Function Vp for a Given Policy pAn Optimal Policy p* Given Value Function V*Example: Mapping Value Function to PolicySlide 23Slide 24Slide 25Slide 26Value Function V* for an optimal policy p*Slide 28Slide 29Slide 30Solving MDPs by Value IterationConvergence of Value IterationExample of Value IterationSlide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Crib Sheet: MDPs by Value IterationSlide 451Planning to Maximize Reward: Markov Decision Processes01/14/19 Brian C. Williams16.412J/6.835J October 23rd, 2002Slides adapted from:Manuela Veloso,Reid Simmons, &Tom Mitchell, CMU2Markov Decision Processes•Motivation•What are Markov Decision Processes?•Computing Action Policies From a Model•Summary3Reading and Assignments•Markov Decision Processes•Read AIMA Chapters 17 sections 1 – 4.Or equivalent in new text•Reinforcement Learning•Read AIMA Chapter 20Or equivalent in new text•Next Homework:•involves coding MDPs, RL and HMM Belief Update Lecture based on development in: “Machine Learning” by Tom Mitchell Chapter 13: Reinforcement Learning4How Might a Mouse Search a Maze for Cheese? •State Space Search?•As a Constraint Satisfaction Problem?•Goal-directed Planning?•As a Rule or Production Systems?•What is missing?Cheese5Ideas in this lecture•Objective is to accumulate rewards, rather than goal states.•Objectives are achieved along the way, rather than at the end.•Task is to generate policies for how to act in all situations, rather than a plan for a single starting situation.•Policies fall out of value functions, which describe the greatest lifetime reward achievable at every state.•Value functions are iteratively approximated.6Markov Decision Processes•Motivation•What are Markov Decision Processes (MDPs)?•Models•Lifetime Reward•Policies•Computing Policies From a Model•Summary7MDP ProblemAgentEnvironments0r0a0s1a1r1s2a2r2s3StateRewardActionGiven an environment model as an MDP create a policy for acting that maximizes lifetime rewardV = r0 +  r1 + 2 r2 . . .8MDP Problem: ModelAgentEnvironments0r0a0s1a1r1s2a2r2s3StateRewardActionGiven an environment model as an MDP create a policy for action that maximizes lifetime rewardV = r0 +  r1 + 2 r2 . . .9Markov Decision Processes (MDPs)Model:•Finite set of states, S•Finite set of actions, A•Probabilistic state transitions, (s,a)•Reward for each state and action, R(s,a)Process:G101010• Legal transitions shown• Reward on unlabeled transitions is 0.s0r0a0s1a1r1s2a2r2s3Example:s1a1•Observe state st in S•Choose action at in A•Receive immediate reward rt•State changes to st+110MDP Environment Assumptions•Markov Assumption: Next state and reward is a function only of the current state and action:•st+1 = (st, at)•rt = r(st, at)•Uncertain and Unknown Environment: and r may be nondeterministic and unknownToday: Deterministic Case Only12MDP Problem: ModelAgentEnvironments0r0a0s1a1r1s2a2r2s3StateRewardActionGiven an environment model as an MDP create a policy for action that maximizes lifetime rewardV = r0 +  r1 + 2 r2 . . .13MDP Problem: Lifetime RewardAgentEnvironments0r0a0s1a1r1s2a2r2s3StateRewardActionGiven an environment model as an MDP create a policy for action that maximizes lifetime rewardV = r0 +  r1 + 2 r2 . . .14Lifetime Reward•Finite horizon:•Rewards accumulate for a fixed period.•$100K + $100K + $100K = $300K•Infinite horizon:•Assume reward accumulates for ever•$100K + $100K + . . . = infinity•Discounting:•Future rewards not worth as much(a bird in hand …)•Introduce discount factor $100K +  $100K + 2 $100K. . . = converges•Will make the math work15MDP Problem:AgentEnvironments0r0a0s1a1r1s2a2r2s3StateRewardActionGiven an environment model as an MDP create a policy for action that maximizes lifetime rewardV = r0 +  r1 + 2 r2 . . .16MDP Problem: PolicyAgentEnvironments0r0a0s1a1r1s2a2r2s3StateRewardActionGiven an environment model as an MDP create a policy for action that maximizes lifetime rewardV = r0 +  r1 + 2 r2 . . .17Assume deterministic worldPolicy : S A•Selects an action for each state.G101010G101010Optimal policy : S A•Selects action for each state that maximizes lifetime reward.18•There are many policies, not all are necessarily optimal.•There may be several optimal policies.G101010G101010G10101019Markov Decision Processes•Motivation•Markov Decision Processes•Computing Policies From a Model•Value Functions•Mapping Value Functions to Policies•Computing Value Functions•Summary20Value Function Vfor a Given Policy •V(st) is the accumulated lifetime reward resulting from starting in state st and repeatedly executing policy : V(st) = rt +  rt+1 + 2 rt+2 . . . V(st) = i i rt+Iwhere rt, rt+1 , rt+2 . . . are generated by following starting at st .10 100Assume = .999 10G101010V21An Optimal Policy * Given Value Function V*Idea: Given state s1. Examine possible actions ai in state s.2. Select action ai with greatest lifetime reward.G1010101010099 10Lifetime reward Q(s, ai) is:•the immediate reward of taking action r(s,a …•plus life time reward starting in target state V((s, a)) …•discounted by *(s) = argmaxar(s,a + V((s, a)]Requires:•Value function•Environment model.• : S x A  S• r : S x A  22Example: Mapping Value Function to Policy•Agent selects optimal action from V(s) = argmaxar(s,a + V((s, a)]9081100900100G100100Model + V: = 0.923Example: Mapping Value Function to Policy•Agent selects optimal action from V(s) = argmaxar(s,a + V((s, a)]9081100900100G100100Model + V:G:ab• a: 0 + 0.9 x 100 = 90• b: 0 + 0.9 x 81 = 72.9 select a = 0.924Example: Mapping Value Function to Policy•Agent selects optimal action from V(s) = argmaxar(s,a + V((s, a)]G9081100900100G100100Model + V::ab• a: 100 + 0.9 x 0 = 100• b: 0 +


View Full Document

MIT 16 412J - Markov Decision Processes

Documents in this Course
Load more
Download Markov Decision Processes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Markov Decision Processes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Markov Decision Processes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?