11Markov DecisionProcesses (MDPs)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityDecember 2nd, 2009©2005-2009 Carlos Guestrin2Markov Decision Process (MDP) Representation State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a) ©2005-2009 Carlos Guestrin23Discount FactorsPeople in economics and probabilistic decision-making do this all the time.The “Discounted sum of future rewards” using discount factor γ” is(reward now) +γ (reward in 1 time step) +γ2(reward in 2 time steps) +γ3(reward in 3 time steps) +:: (infinite sum)©2005-2009 Carlos Guestrin4The Academic LifeDefine:VA= Expected discounted future rewards starting in state AVB= Expected discounted future rewards starting in state BVT= “ “ “ “ “ “ “ TVS= “ “ “ “ “ “ “ SVD= “ “ “ “ “ “ “ DHow do we compute VA, VB, VT, VS, VD?A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf4000.70.70.60.30.20.20.20.30.60.2©2005-2009 Carlos Guestrin35PolicyPolicy: π(x) = aAt state x, action a for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2©2005-2009 Carlos Guestrin6Value of PolicyValue: Vπ(x)Expected long-term reward starting from xStart from x0x0R(x0)ππππ(x0)Vπ(x0) = Eππππ[R(x0) + γ R(x1) + γ2R(x2) + γ3R(x3) + γ4R(x4) + …]Future rewards discounted by γ 2 [0,1)x1R(x1)x1’’x1’R(x1’)R(x1’’)ππππ(x1)x2R(x2)ππππ(x2)x3R(x3)ππππ(x3)x4R(x4)π(x1’)π(x1’’)©2005-2009 Carlos Guestrin47Computing the value of a policyVπ(x0) = Eππππ[R(x0) + γ R(x1) + γ2R(x2) + γ3R(x3) + γ4R(x4) + …] Discounted value of a state: value of starting from x0and continuing with policy π from then on A recursion!©2005-2009 Carlos Guestrin8Simple approach for computing the value of a policy: Iteratively Can solve using a simple convergent iterative approach: (a.k.a. dynamic programming) Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||∞< ε means that ||Vπ-Vt+1||∞< ε/(1-γ)©2005-2009 Carlos Guestrin59But we want to learn a PolicyPolicy: π(x) = aAt state x, action a for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2 So far, told you how good a policy is… But how can we choose the best policy??? Suppose there was only one time step: world is about to end!!! select action that maximizes reward!©2005-2009 Carlos Guestrin10Unrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V*©2005-2009 Carlos Guestrin611Bellman equation Evaluating policy π: Computing the optimal value V*- Bellman equation ∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ©2005-2009 Carlos Guestrin12Optimal Long-term PlanOptimal Policy: π*(x)Optimal value function V*(x)Optimal policy:π∗(x) = argmaxaR(x,a) +γP(x'| x,a)V∗(x')x'∑©2005-2009 Carlos Guestrin713Interesting fact – Unique value Slightly surprising fact: There is only one V*that solves Bellman equation! there may be many optimal policies that achieve V* Surprising fact: optimal policies are good everywhere!!!∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ©2005-2009 Carlos Guestrin14Solving an MDP Policy iteration [Howard ‘60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ‘60] …Solve Bellman equationOptimal value V*(x)Optimal policy π*(x)Many algorithms solve the Bellman equations:∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγBellman equation is non-linear!!!©2005-2009 Carlos Guestrin815Value iteration (a.k.a. dynamic programming) –the simplest of all Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||∞< ε means that ||V∗-Vt+1||∞< ε/(1-γ)©2005-2009 Carlos Guestrin16A simple exampleYou run a startup company.In every state you must choose between Saving money or Advertising.γ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SSAAS1111/21/21/21/21/21/21/21/21/21/2©2005-2009 Carlos Guestrin917Let’s compute Vt(x) for our examplet Vt(PU) Vt(PF) Vt(RU) Vt(RF)123456γ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SSAAS1111/21/21/21/21/21/21/21/21/21/2∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ©2005-2009 Carlos Guestrin18Let’s compute Vt(x) for our examplet Vt(PU) Vt(PF) Vt(RU) Vt(RF)1 0 0 10 102 0 4.5 14.5 193 2.03 6.53 25.08 18.554 3.852 12.20 29.63 19.265 7.22 15.07 32.00 20.406 10.03 17.65 33.58 22.43γ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SSAAS1111/21/21/21/21/21/21/21/21/21/2∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ©2005-2009 Carlos Guestrin1019What you need to know What’s a Markov decision process state, actions, transitions, rewards a policy value function for a policy computing Vπ Optimal value function and optimal policy Bellman equation Solving Bellman equation with value iteration, policy iteration and linear programming©2005-2009 Carlos Guestrin20Acknowledgment This lecture contains some material from Andrew Moore’s excellent collection of ML tutorials: http://www.cs.cmu.edu/~awm/tutorials©2005-2009 Carlos Guestrin1121Reinforcement LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 29th, 2007©2005-2009 Carlos Guestrin22The Reinforcement Learning taskWorld: You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions. ©2005-2009 Carlos Guestrin1223Formalizing the (online) reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xtand reward rt you give next action at Goal: (quickly)
View Full Document