1Markov DecisionProcesses (MDPs)(cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 29th, 20072©2005-2007 Carlos GuestrinMarkov Decision Process (MDP)Representation State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a)3©2005-2007 Carlos GuestrinComputing the value of a policyVπ(x0) = Eπ[R(x0) + γ R(x1) + γ2 R(x2) + γ3 R(x3) + γ4 R(x4) + L] Discounted value of a state: value of starting from x0 and continuing with policy π from then on A recursion!4©2005-2007 Carlos GuestrinSimple approach for computing thevalue of a policy: Iteratively Can solve using a simple convergent iterative approach:(a.k.a. dynamic programming) Start with some guess V0 Iteratively say: Vt+1 = R + γ Pπ Vt Stop when ||Vt+1-Vt||1 · ε means that ||Vπ-Vt+1||1 · ε/(1-γ)5©2005-2007 Carlos GuestrinBut we want to learn a PolicyPolicy: π(x) = aAt state x, actiona for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2 So far, told you how good apolicy is… But how can we choose thebest policy??? Suppose there was only onetime step: world is about to end!!! select action that maximizesreward!6©2005-2007 Carlos GuestrinUnrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V*7©2005-2007 Carlos GuestrinBellman equation Evaluating policy π: Computing the optimal value V* - Bellman equation!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#8©2005-2007 Carlos GuestrinOptimal Long-term PlanOptimal Policy: π*(x)Optimal valuefunction V*(x)Optimal policy:! "#(x) = argmaxaR(x,a) +$P(x'| x,a)V#(x')x'%9©2005-2007 Carlos GuestrinInteresting fact – Unique value Slightly surprising fact: There is only one V* that solvesBellman equation! there may be many optimal policies that achieve V* Surprising fact: optimal policies are good everywhere!!!!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#10©2005-2007 Carlos GuestrinSolving an MDP Policy iteration [Howard ‘60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ‘60] …SolveBellmanequationOptimalvalue V*(x)Optimalpolicy π*(x)Many algorithms solve the Bellman equations:!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#Bellman equation is non-linear!!!11©2005-2007 Carlos GuestrinValue iteration (a.k.a. dynamic programming) –the simplest of all Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||1 · ε means that ||V∗-Vt+1||1 · ε/(1-γ)!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#!+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRV"12©2005-2007 Carlos GuestrinA simple exampleYou run astartupcompany.In everystate youmustchoosebetweenSavingmoney orAdvertising.γ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/213©2005-2007 Carlos GuestrinLet’s compute Vt(x) for our example654321Vt(RF)Vt(RU)Vt(PF)Vt(PU)tγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/2!+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRV"14©2005-2007 Carlos GuestrinLet’s compute Vt(x) for our example22.4333.5817.6510.03620.4032.0015.077.22519.2629.6312.203.852418.5525.086.532.0331914.54.5021010001Vt(RF)Vt(RU)Vt(PF)Vt(PU)tγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/2!+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRV"15©2005-2007 Carlos GuestrinWhat you need to know What’s a Markov decision process state, actions, transitions, rewards a policy value function for a policy computing Vπ Optimal value function and optimal policy Bellman equation Solving Bellman equation with value iteration, (other possibilities: policy iterationand linear programming)16©2005-2007 Carlos GuestrinAcknowledgment This lecture contains some material fromAndrew Moore’s excellent collection of MLtutorials: http://www.cs.cmu.edu/~awm/tutorials17ReinforcementLearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 29th, 200718©2005-2007 Carlos GuestrinThe Reinforcement Learning taskWorld: You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions.19©2005-2007 Carlos GuestrinFormalizing the (online)reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xt and reward rt you give next action at Goal: (quickly) learn policy that (approximately)maximizes long-term expected discounted reward20©2005-2007 Carlos GuestrinThe “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of myactions along the way actually helped me get there??This is the Credit Assignment problem.“ = 100,“ “ “ 26, “ = 2“ = 0,“ “ “ 54, “ = 2“ = 0,“ “ “ 13, “ = 1“ = 0,“ “ “ 21, “ = 1“ = 0,“ “ “ 21, “ = 1“ = 0,“ “ “ 22, “ = 4“ = 0,“ “ “ 39,action = 2reward = 0,I’m in state 43,21©2005-2007 Carlos GuestrinExploration-Exploitation tradeoff You have visited part of the statespace and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick withwhat I know and find a goodpolicy w.r.t. this knowledge? at the risk of missing out on somelarge reward somewhere Exploration: should I look for aregion with more reward? at the risk of wasting my time orcollecting a lot of negative reward22©2005-2007 Carlos GuestrinTwo main reinforcement learningapproaches Model-based approaches: explore environment, then learn model (P(x’|x,a) and R(x,a))(almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state
View Full Document