DOC PREVIEW
CMU CS 10701 - mdps

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Markov Decision Processes MDPs Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 30th 2007 2005 2007 Carlos Guestrin 1 Thus far this semester Regression Classification Density estimation 2 2005 2007 Carlos Guestrin Learning to act Reinforcement learning An agent Makes sensor observations Must select action Receives rewards Ng et al 05 positive for good states negative for bad states 3 2005 2007 Carlos Guestrin Learning to play backgammon Tesauro 95 Combines reinforcement learning with neural networks Played 300 000 games against itself Achieved grandmaster level 4 2005 2007 Carlos Guestrin Roadmap to learning about reinforcement learning When we learned about Bayes nets First talked about formal framework representation inference Then learning for BNs For reinforcement learning Formal framework Markov decision processes Then learning 5 2005 2007 Carlos Guestrin peasant footman building Real time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen 6 2005 2007 Carlos Guestrin States and actions State space Joint state x of entire system Action space Joint action a a1 an for all agents 7 2005 2007 Carlos Guestrin States change over time Like an HMM state changes over time Next state depends on current state and action selected e g action build castle likely to lead to a state where you have a castle Transition model Dynamics of the entire system P x x a 8 2005 2007 Carlos Guestrin Some states and actions are better than others Each state x is associated with a reward positive reward for successful attack negative for loss Reward function Total reward R x 9 2005 2007 Carlos Guestrin Discounted Rewards An assistant professor gets paid say 20K per year How much in total will the A P earn in their life 20 20 20 20 20 Infinity What s wrong with this argument 10 2005 2007 Carlos Guestrin Discounted Rewards A reward payment in the future is not worth quite as much as a reward now Because of chance of obliteration Because of inflation Example Being promised 10 000 next year is worth only 90 as much as receiving 10 000 right now Assuming payment n years in future is worth only 0 9 n of payment now what is the AP s Future Discounted Sum of Rewards 11 2005 2007 Carlos Guestrin Discount Factors People in economics and probabilistic decision making do this all the time The Discounted sum of future rewards using discount factor is reward now reward in 1 time step 2 reward in 2 time steps 3 reward in 3 time steps infinite sum 12 2005 2007 Carlos Guestrin ount c s i me D 0 9 u s s A or t c a F The Academic Life 0 6 0 6 0 2 B Assoc Prof 60 A Assistant Prof 20 0 2 0 2 S On the Street 10 0 2 0 7 T Tenured Prof 400 D Dead 0 0 3 Define 0 7 0 3 VA Expected discounted future rewards starting in state A VB Expected discounted future rewards starting in state B VT T VS S VD D How do we compute VA VB VT VS VD 2005 2007 Carlos Guestrin 13 Computing the Future Rewards of an Academic 0 6 0 2 A Assistant Prof 20 0 2 S On the Street 10 0 7 0 6 B Assoc Prof 60 0 2 0 2 0 7 T Tenured Prof 400 D Dead 0 0 3 0 3 Assume Discount Factor 0 9 14 2005 2007 Carlos Guestrin Joint Decision Space Markov Decision Process MDP Representation State space Action space Joint state x of entire system Joint action a a1 an for all agents Reward function Total reward R x a sometimes reward can depend on action Transition model Dynamics of the entire system P x x a 15 2005 2007 Carlos Guestrin Policy At state x action a for all agents Policy x a x0 x0 both peasants get wood x1 x1 one peasant builds barrack other gets gold x2 x2 peasants get gold footmen attack 16 2005 2007 Carlos Guestrin Value of Policy Expected longterm reward starting from x Value V x Start from x0 x0 x0 R x0 V x0 E R x0 R x1 2 R x2 3 R x3 4 R x4 L x1 x1 R x1 x1 R x1 x1 R x1 Future rewards discounted by 2 0 1 x1 x2 x2 R x2 x1 x3 R x3 2005 2007 Carlos Guestrin x3 x4 R x4 17 Computing the value of a policy V x0 E R x0 R x1 2 R x2 3 R x3 4 R x4 L Discounted value of a state value of starting from x0 and continuing with policy from then on A recursion 18 2005 2007 Carlos Guestrin Computing the value of a policy 1 the matrix inversion approach Solve by simple matrix inversion 19 2005 2007 Carlos Guestrin Computing the value of a policy 2 iteratively If you have 1000 000 states inverting a 1000 000x1000 000 matrix is hard Can solve using a simple convergent iterative approach a k a dynamic programming Start with some guess V0 Iteratively say Vt 1 R P Vt Stop when Vt 1 Vt 1 means that V Vt 1 1 1 20 2005 2007 Carlos Guestrin But we want to learn a Policy So far told you how good a policy is But how can we choose the best policy Policy x a x0 x0 both peasants get wood x1 Suppose there was only one time step At state x action a for all agents x1 one peasant builds barrack other gets gold x2 x2 peasants get gold footmen attack world is about to end select action that maximizes reward 21 2005 2007 Carlos Guestrin Another recursion Two time steps address tradeoff good reward now better reward in the future 22 2005 2007 Carlos Guestrin Unrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V 23 2005 2007 Carlos Guestrin Bellman equation Bellman 60 Evaluating policy Computing the optimal value V Bellman equation V x max R x a P x x a V x a x 24 2005 2007 Carlos Guestrin Optimal Long term Plan Optimal value function V x Optimal Policy x Q x a R x a P x x a V x x Optimal policy x arg max Q x a a 25 2005 2007 Carlos Guestrin Interesting fact Unique value V x max R x a P x x a V x a Slightly surprising fact There is only one V that solves Bellman equation x there may be many optimal policies that achieve V Surprising fact optimal policies are good everywhere 26 2005 2007 Carlos Guestrin Solving an MDP Solve Bellman equation Optimal value V x Optimal policy x V x max R x a P x x a V x a x Bellman equation is non linear Many algorithms solve the Bellman equations Policy iteration Howard 60 Bellman 57 Value iteration Bellman 57 Linear programming Manne 60 2005 2007 Carlos Guestrin 27 Value iteration a k a dynamic programming the simplest …


View Full Document

CMU CS 10701 - mdps

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download mdps
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view mdps and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view mdps and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?