CMU CS 10701 - mdps - D830717

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> mdps

DOC PREVIEW

CMU CS 10701 - mdps

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 42

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Markov Decision Processes MDPs Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 30th 2007 2005 2007 Carlos Guestrin 1 Thus far this semester Regression Classification Density estimation 2 2005 2007 Carlos Guestrin Learning to act Reinforcement learning An agent Makes sensor observations Must select action Receives rewards Ng et al 05 positive for good states negative for bad states 3 2005 2007 Carlos Guestrin Learning to play backgammon Tesauro 95 Combines reinforcement learning with neural networks Played 300 000 games against itself Achieved grandmaster level 4 2005 2007 Carlos Guestrin Roadmap to learning about reinforcement learning When we learned about Bayes nets First talked about formal framework representation inference Then learning for BNs For reinforcement learning Formal framework Markov decision processes Then learning 5 2005 2007 Carlos Guestrin peasant footman building Real time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen 6 2005 2007 Carlos Guestrin States and actions State space Joint state x of entire system Action space Joint action a a1 an for all agents 7 2005 2007 Carlos Guestrin States change over time Like an HMM state changes over time Next state depends on current state and action selected e g action build castle likely to lead to a state where you have a castle Transition model Dynamics of the entire system P x x a 8 2005 2007 Carlos Guestrin Some states and actions are better than others Each state x is associated with a reward positive reward for successful attack negative for loss Reward function Total reward R x 9 2005 2007 Carlos Guestrin Discounted Rewards An assistant professor gets paid say 20K per year How much in total will the A P earn in their life 20 20 20 20 20 Infinity What s wrong with this argument 10 2005 2007 Carlos Guestrin Discounted Rewards A reward payment in the future is not worth quite as much as a reward now Because of chance of obliteration Because of inflation Example Being promised 10 000 next year is worth only 90 as much as receiving 10 000 right now Assuming payment n years in future is worth only 0 9 n of payment now what is the AP s Future Discounted Sum of Rewards 11 2005 2007 Carlos Guestrin Discount Factors People in economics and probabilistic decision making do this all the time The Discounted sum of future rewards using discount factor is reward now reward in 1 time step 2 reward in 2 time steps 3 reward in 3 time steps infinite sum 12 2005 2007 Carlos Guestrin ount c s i me D 0 9 u s s A or t c a F The Academic Life 0 6 0 6 0 2 B Assoc Prof 60 A Assistant Prof 20 0 2 0 2 S On the Street 10 0 2 0 7 T Tenured Prof 400 D Dead 0 0 3 Define 0 7 0 3 VA Expected discounted future rewards starting in state A VB Expected discounted future rewards starting in state B VT T VS S VD D How do we compute VA VB VT VS VD 2005 2007 Carlos Guestrin 13 Computing the Future Rewards of an Academic 0 6 0 2 A Assistant Prof 20 0 2 S On the Street 10 0 7 0 6 B Assoc Prof 60 0 2 0 2 0 7 T Tenured Prof 400 D Dead 0 0 3 0 3 Assume Discount Factor 0 9 14 2005 2007 Carlos Guestrin Joint Decision Space Markov Decision Process MDP Representation State space Action space Joint state x of entire system Joint action a a1 an for all agents Reward function Total reward R x a sometimes reward can depend on action Transition model Dynamics of the entire system P x x a 15 2005 2007 Carlos Guestrin Policy At state x action a for all agents Policy x a x0 x0 both peasants get wood x1 x1 one peasant builds barrack other gets gold x2 x2 peasants get gold footmen attack 16 2005 2007 Carlos Guestrin Value of Policy Expected longterm reward starting from x Value V x Start from x0 x0 x0 R x0 V x0 E R x0 R x1 2 R x2 3 R x3 4 R x4 L x1 x1 R x1 x1 R x1 x1 R x1 Future rewards discounted by 2 0 1 x1 x2 x2 R x2 x1 x3 R x3 2005 2007 Carlos Guestrin x3 x4 R x4 17 Computing the value of a policy V x0 E R x0 R x1 2 R x2 3 R x3 4 R x4 L Discounted value of a state value of starting from x0 and continuing with policy from then on A recursion 18 2005 2007 Carlos Guestrin Computing the value of a policy 1 the matrix inversion approach Solve by simple matrix inversion 19 2005 2007 Carlos Guestrin Computing the value of a policy 2 iteratively If you have 1000 000 states inverting a 1000 000x1000 000 matrix is hard Can solve using a simple convergent iterative approach a k a dynamic programming Start with some guess V0 Iteratively say Vt 1 R P Vt Stop when Vt 1 Vt 1 means that V Vt 1 1 1 20 2005 2007 Carlos Guestrin But we want to learn a Policy So far told you how good a policy is But how can we choose the best policy Policy x a x0 x0 both peasants get wood x1 Suppose there was only one time step At state x action a for all agents x1 one peasant builds barrack other gets gold x2 x2 peasants get gold footmen attack world is about to end select action that maximizes reward 21 2005 2007 Carlos Guestrin Another recursion Two time steps address tradeoff good reward now better reward in the future 22 2005 2007 Carlos Guestrin Unrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V 23 2005 2007 Carlos Guestrin Bellman equation Bellman 60 Evaluating policy Computing the optimal value V Bellman equation V x max R x a P x x a V x a x 24 2005 2007 Carlos Guestrin Optimal Long term Plan Optimal value function V x Optimal Policy x Q x a R x a P x x a V x x Optimal policy x arg max Q x a a 25 2005 2007 Carlos Guestrin Interesting fact Unique value V x max R x a P x x a V x a Slightly surprising fact There is only one V that solves Bellman equation x there may be many optimal policies that achieve V Surprising fact optimal policies are good everywhere 26 2005 2007 Carlos Guestrin Solving an MDP Solve Bellman equation Optimal value V x Optimal policy x V x max R x a P x x a V x a x Bellman equation is non linear Many algorithms solve the Bellman equations Policy iteration Howard 60 Bellman 57 Value iteration Bellman 57 Linear programming Manne 60 2005 2007 Carlos Guestrin 27 Value iteration a k a dynamic programming the simplest …

View Full Document

CMU CS 10701 - mdps

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

CMU CS 10701 - mdps

Sign up for free to view:

Please select your school