Machine Learning CS6375 Spring 2015 Markov Decision Processes Acknowledge slides based on Andrew Moore s MDP tutorial 1 Rewards An assistant professor gets paid say 20K per year How much in total will the A P earn in their life 20 20 20 20 20 Infinity What s wrong with this argument 2 1 Discounted Rewards A reward payment in the future is not worth quite as much as a reward now Because of chance of obliteration Because of inflation Example Being promised 10 000 next year is worth only 90 as much as receiving 10 000 right now Assuming payment n years in future is worth only 0 9 n of payment now what is the AP s Future Discounted Sum of Rewards 3 Discount Factors People in economics and probabilistic decision making do this all the time The Discounted sum of future rewards using discount factor is reward now reward in 1 time step 2 reward in 2 time steps 3 reward in 3 time steps infinite sum 4 2 Discounting Always converges if 1 and the reward function R is bounded close to 0 instant gratification don t pay attention to future reward close to 1 extremely conservative consider profits losses no matter how far in the future The resulting model is the discounted reward 5 The Academic Life Define JA Expected discounted future rewards starting in state A JB Expected discounted future rewards starting in state B JT T JS S JD D How do we compute JA JB JT JS JD 6 3 A Markov System with Rewards Has a set of states S1 S2 SN Has a transition probability matrix P11 P12 P1N P P21 Pij Prob NextState Sj This Si PN1 PNN Each state has a reward r1 r2 rN There s a discount factor 0 1 On Each Time Step 0 Assume your state is Si 1 You get given reward ri 2 You randomly move to another state P NextState Sj This Si Pij 3 All future rewards are discounted by 7 Solving a Markov System Write J Si expected discounted sum of future rewards starting in state Si J Si ri x Expected future rewards starting from your next state ri Pi1J S1 Pi2J S2 PiNJ SN Using vector notation write J S1 r1 P11 P12 P1N J J S2 R r2 P P21 J SN rN PN1 PN2 PNN Question can you get a closed form expression for J in terms of R P and 8 4 Solving a Markov System with Matrix Inversion J R P J Upside You get an exact answer Downside If you have 100 000 states you re solving a 100 000 by 100 000 system of equations 9 Value Iteration another way to solve a M S Define J1 Si Expected discounted sum of rewards over the next 1 time step J2 Si Expected discounted sum of rewards during next 2 steps J3 Si Expected discounted sum of rewards during next 3 steps Jk Si Expected discounted sum of rewards during next k steps J 1 S i ri N J 2 Si ri pij J 1 s j N of states j 1 N J k 1 S i ri pij J k s j j 1 10 5 Let s do Value Iteration 11 Value Iteration Example 12 6 Value Iteration for Solving Markov Systems Compute J1 Si for each j Compute J2 Si for each j Compute Jk Si for each j As k Jk Si J Si When to stop When Max i Jk 1 Si Jk Si This is faster than matrix inversion if the transition matrix is sparse 13 Different Types of Markov Models State No control control Fully observable Markov systems MDPs Hidden HMMs POMDPs 14 7 Markov Decision Processes An Example I run a startup company I can choose to either save money or spend money on advertising If I advertise I may become famous 50 prob but will spend money so I may become poor If I save money I may become rich 50 prob but I may also become unknown because I don t advertise What should I do 15 A Markov Decision Process You run a startup company In every state you must choose between saving money or advertising 16 8 Markov Decision Processes An MDP has A set of states s1 SN A set of actions a1 aM A set of rewards r1 rN one for each state A transition probability function On each step 0 Call current state Si 1 Receive reward ri 2 Choose action a1 aM 3 If you choose action ak you ll move to state Sj with probability 4 All future rewards are discounted by 17 What we are looking for Policy A policy is a mapping from states to actions How many policies Which one is the best policy and how to compute it 18 9 Optimal Decision Policy Policy Mapping from states to action s a Which action should I take in each state Intuition encodes the best action that we can take from any state to maximize future rewards Optimal Policy The policy that maximizes the expected utility J s of the sequence of states generated by starting at s 19 Example Finance and Business States Status of the company cash reserves inventory etc Actions Business decisions advertise acquire other companies roll out product etc Uncertainty due to all the external uncontrollable factors economy shortages consumer confidence Optimal policy The policy for making business decisions that maximizes the expected future profits 20 10 Example Robotics States are 2 D positions Actions are commanded motions turn by x degrees move y meters Uncertainty comes from the fact that the mechanism is not perfect slippage etc and does not execute the commands exactly Reward when avoiding forbidden regions for example Optimal policy The policy that minimizes the cost to the goal 21 Key Results For every MDP there exists an optimal policy There is no better option in terms of expected sum of rewards than to follow this policy How to compute the optimal policy 22 11 Computing the Optimal Policy Idea One Run through all possible policies Select the best Note for a given policy it s a Markov system can calculate expected discounted reward Not tractable 23 Computing the Optimal Value Function with Value Iteration Define J Si Expected Discounted Future Rewards starting from state Si assuming we use the optimal policy Define Jn Si Maximum possible expected sum of discounted rewards I can get if I start at state Si and I live for n time steps Note that J1 Si ri 24 12 Bellman s Equation Value Iteration for solving MDPs Compute J1 Si for all i Compute J2 Si for all i Compute Jn Si for all i until converged 25 Computing Jn Si for our example 26 13 Compute Jn Si for our example n Jn PU Jn PF Jn RU Jn RF 1 0 0 10 10 2 0 4 5 14 5 19 3 2 03 8 55 16 53 25 08 4 27 Convergence Results Exponentially fast convergence Slower convergence as increases 28 14 Finding the …
View Full Document