1Markov DecisionProcesses (MDPs)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 1st, 2006Reading:Kaelbling et al. 1996 (see class website)2Announcements Project: Poster session: Friday May 5th2-5pm, NSH Atrium please arrive a little early to set up FCEs!!!! Please, please, please, please, please, please give us your feedback, it helps us improve the class! ☺ http://www.cmu.edu/fce3Discount FactorsPeople in economics and probabilistic decision-making do this all the time.The “Discounted sum of future rewards” using discount factor γ” is(reward now) +γ (reward in 1 time step) +γ2(reward in 2 time steps) +γ3(reward in 3 time steps) +:: (infinite sum)4The Academic LifeDefine:VA= Expected discounted future rewards starting in state AVB= Expected discounted future rewards starting in state BVT= “ “ “ “ “ “ “ TVS= “ “ “ “ “ “ “ SVD= “ “ “ “ “ “ “ DHow do we compute VA, VB, VT, VS, VD?A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf400Assume Discount Factor γ= 0.90.70.70.60.30.20.20.20.30.60.25Computing the Future Rewards of an AcademicAssume Discount Factor γ = 0.90.7A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf4000.70.60.30.20.20.20.30.60.26Joint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a) Markov Decision Process (MDP) Representation:7PolicyPolicy: π(x) = aAt state x, action a for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx28Value of PolicyValue: Vπ(x)Expected long-term reward starting from xStart from x0x0R(x0)ππππ(x0)Vπ(x0) = Eππππ[R(x0) + γ R(x1) + γ2R(x2) + γ3R(x3) + γ4R(x4) + L]Future rewards discounted by γ ∈ [0,1)x1R(x1)x1’’x1’R(x1’)R(x1’’)ππππ(x1)x2R(x2)ππππ(x2)x3R(x3)ππππ(x3)x4R(x4)π(x1’)π(x1’’)9Computing the value of a policyVπ(x0) = Eππππ[R(x0) + γ R(x1) + γ2R(x2) + γ3R(x3) + γ4R(x4) + L] Discounted value of a state: value of starting from x0and continuing with policy π from then on A recursion!10Computing the value of a policy 1 –the matrix inversion approach Solve by simple matrix inversion:11Computing the value of a policy 2 –iteratively If you have 1000,000 states, inverting a 1000,000x1000,000 matrix is hard! Can solve using a simple convergent iterative approach: (a.k.a. dynamic programming) Start with some guess V0 Iteratively say: Vt+1= R + γ PπVt Stop when ||Vt+1-Vt||∞ ε means that ||Vπ-Vt+1||∞ ε/(1-γ)12But we want to learn a PolicyPolicy: π(x) = aAt state x, action a for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2 So far, told you how good a policy is… But how can we choose the best policy??? Suppose there was only one time step: world is about to end!!! select action that maximizes reward!13Another recursion! Two time steps: address tradeoff good reward now better reward in the future14Unrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V*15Bellman equation Evaluating policy π: Computing the optimal value V*- Bellman equation ∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ16Optimal Long-term PlanOptimal Policy: π*(x)Optimal value function V*(x)Optimal policy:)a,x(maxarg)x(a∗∗∗∗∗∗∗∗==== Qππππ∑∗∗+=')'(),|'(),(),(xxaxxaxax VPRQγ17Interesting fact – Unique value Slightly surprising fact: There is only one V*that solves Bellman equation! there may be many optimal policies that achieve V* Surprising fact: optimal policies are good everywhere!!!∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ18Solving an MDP Policy iteration [Howard ‘60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ‘60] …Solve Bellman equationOptimal value V*(x)Optimal policy π*(x)Many algorithms solve the Bellman equations:∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγBellman equation is non-linear!!!19Value iteration (a.k.a. dynamic programming) –the simplest of all Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||∞ ε means that ||V∗-Vt+1||∞ ε/(1-γ)∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ20A simple exampleYou run a startup company.In every state you must choose between Saving money or Advertising.γ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/221Let’s compute Vt(x) for our example654321Vt(RF)Vt(RU)Vt(PF)Vt(PU)tγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/2∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ22Let’s compute Vt(x) for our example22.4333.5817.6510.03620.4032.0015.077.22519.2629.6312.203.852418.5525.086.532.0331914.54.5021010001Vt(RF)Vt(RU)Vt(PF)Vt(PU)tγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/2∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ23Policy iteration – Another approach for computing π* Start with some guess for a policy π0 Iteratively say: evaluate policy: improve policy: Stop when policy stops changing usually happens in about 10 iterations or ||Vt+1-Vt||∞ ε means that ||V∗-Vt+1||∞ ε/(1-γ)∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRγπ∑=+==')'())(,|'())(,()(xxxaxxxaxxttttVPRVπγπ24Policy Iteration & Value Iteration: Which is best ???It depends.Lots of actions? Choose Policy IterationAlready got a fair policy? Policy IterationFew actions, acyclic? Value IterationBest of Both Worlds:Modified Policy Iteration [Puterman]…a simple mix of value iteration and policy iteration3rdApproachLinear Programming25LP Solution to MDPValue computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Polynomial time solution[Manne ‘60]:subject to:minimize≥∑,∀axx)(xV)(xV)(xV,∀∀∀∀ax)(xV∑+')'(),|'(),(xxaxxax VPRγ26What you
View Full Document