©2005-2007 Carlos Guestrin1Markov DecisionProcesses (MDPs)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 30th, 20072©2005-2007 Carlos GuestrinThus far this semester Regression: Classification: Density estimation:3©2005-2007 Carlos GuestrinLearning to act Reinforcementlearning An agent Makes sensorobservations Must select action Receives rewards positive for “good”states negative for “bad”states[Ng et al. ’05]4©2005-2007 Carlos GuestrinLearning to play backgammon[Tesauro ’95] Combines reinforcementlearning with neural networks Played 300,000 games againstitself Achieved grandmaster level!5©2005-2007 Carlos GuestrinRoadmap to learning aboutreinforcement learning When we learned about Bayes nets: First talked about formal framework: representation inference Then learning for BNs For reinforcement learning: Formal framework Markov decision processes Then learning6©2005-2007 Carlos GuestrinpeasantfootmanbuildingReal-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen7©2005-2007 Carlos GuestrinStates and actions State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents8©2005-2007 Carlos GuestrinStates change over time Like an HMM, state changes overtime Next state depends on current stateand action selected e.g., action=“build castle” likely to leadto a state where you have a castle Transition model: Dynamics of the entire system P(x’|x,a)9©2005-2007 Carlos GuestrinSome states and actions arebetter than others Each state x is associated with areward positive reward for successful attack negative for loss Reward function: Total reward R(x)10©2005-2007 Carlos GuestrinDiscounted RewardsAn assistant professor gets paid, say, 20K per year.How much, in total, will the A.P. earn in their life?20 + 20 + 20 + 20 + 20 + … = InfinityWhat’s wrong with this argument?$ $11©2005-2007 Carlos GuestrinDiscounted Rewards“A reward (payment) in the future is not worth quite asmuch as a reward now.” Because of chance of obliteration Because of inflationExample:Being promised $10,000 next year is worth only 90% as much asreceiving $10,000 right now.Assuming payment n years in future is worth only (0.9)n ofpayment now, what is the AP’s Future Discounted Sum ofRewards ?12©2005-2007 Carlos GuestrinDiscount FactorsPeople in economics and probabilistic decision-making dothis all the time.The “Discounted sum of future rewards” using discountfactor γ” is (reward now) + γ (reward in 1 time step) + γ 2 (reward in 2 time steps) + γ 3 (reward in 3 time steps) +:: (infinite sum)13©2005-2007 Carlos GuestrinThe Academic LifeDefine:VA = Expected discounted future rewards starting in state AVB = Expected discounted future rewards starting in state BVT = “ “ “ “ “ “ “ TVS = “ “ “ “ “ “ “ SVD = “ “ “ “ “ “ “ DHow do we compute VA, VB, VT, VS, VD ?A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf400Assume DiscountFactor γ = 0.90.70.70.60.30.20.20.20.30.60.214©2005-2007 Carlos GuestrinComputing the Future Rewards ofan AcademicAssume DiscountFactor γ = 0.90.7A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf4000.70.60.30.20.20.20.30.60.215©2005-2007 Carlos GuestrinJoint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a)Markov Decision Process (MDP) Representation:16©2005-2007 Carlos GuestrinPolicyPolicy: π(x) = aAt state x,action a for allagentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx217©2005-2007 Carlos GuestrinValue of PolicyValue: Vπ(x)Expected long-term rewardstarting from xStart from x0x0R(x0)π(x0)Vπ(x0) = Eπ[R(x0) + γ R(x1) + γ2 R(x2) + γ3 R(x3) + γ4 R(x4) + L]Future rewards discounted by γ 2 [0,1)x1R(x1) x1’’ x1’R(x1’)R(x1’’)π(x1)x2R(x2)π(x2)x3R(x3)π(x3)x4R(x4)π(x1’)π(x1’’)18©2005-2007 Carlos GuestrinComputing the value of a policyVπ(x0) = Eπ[R(x0) + γ R(x1) + γ2 R(x2) + γ3 R(x3) + γ4 R(x4) + L] Discounted value of a state: value of starting from x0 and continuing with policy π from then on A recursion!19©2005-2007 Carlos GuestrinComputing the value of a policy 1 –the matrix inversion approach Solve by simple matrix inversion:20©2005-2007 Carlos GuestrinComputing the value of a policy 2 –iteratively If you have 1000,000 states, inverting a 1000,000x1000,000matrix is hard! Can solve using a simple convergent iterative approach:(a.k.a. dynamic programming) Start with some guess V0 Iteratively say: Vt+1 = R + γ Pπ Vt Stop when ||Vt+1-Vt||1 · ε means that ||Vπ-Vt+1||1 · ε/(1-γ)21©2005-2007 Carlos GuestrinBut we want to learn a PolicyPolicy: π(x) = aAt state x, actiona for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2 So far, told you how good apolicy is… But how can we choose thebest policy??? Suppose there was only onetime step: world is about to end!!! select action that maximizesreward!22©2005-2007 Carlos GuestrinAnother recursion! Two time steps: address tradeoff good reward now better reward in the future23©2005-2007 Carlos GuestrinUnrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V*24©2005-2007 Carlos GuestrinBellman equation [Bellman 60] Evaluating policy π: Computing the optimal value V* - Bellman equation!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#25©2005-2007 Carlos GuestrinOptimal Long-term PlanOptimal Policy: π*(x)Optimal valuefunction V*(x)Optimal policy:)a,x(maxarg)x(a!!= Q"!""+=')'(),|'(),(),(xxaxxaxax VPRQ#26©2005-2007 Carlos GuestrinInteresting fact – Unique value Slightly surprising fact: There is only
View Full Document