1Markov DecisionProcesses (MDPs)(cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 29th, 20072©2005-2007 Carlos GuestrinMarkov Decision Process (MDP) Representation State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on actionTransition model: Dynamics of the entire system P(x’|x,a)3©2005-2007 Carlos GuestrinComputing the value of a policyVπ(x0) = Eπ[R(x0) + γ R(x1) + γ2R(x2) + γ3R(x3) + γ4R(x4) + ] Discounted value of a state: value of starting from x0and continuing with policy π from then on A recursion!4©2005-2007 Carlos GuestrinSimple approach for computing the value of a policy: Iteratively Can solve using a simple convergent iterative approach: (a.k.a. dynamic programming) Start with some guess V0 Iteratively say: Vt+1= R + γ PπVt Stop when ||Vt+1-Vt||1· ε means that ||Vπ-Vt+1||1· ε/(1-γ)5©2005-2007 Carlos GuestrinBut we want to learn a PolicyPolicy: π(x) = aAt state x, action a for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2 So far, told you how good a policy is… But how can we choose the best policy??? Suppose there was only one time step: world is about to end!!! select action that maximizes reward!6©2005-2007 Carlos GuestrinUnrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V*7©2005-2007 Carlos GuestrinBellman equation Evaluating policy π: Computing the optimal value V*- Bellman equation ∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ8©2005-2007 Carlos GuestrinOptimal Long-term PlanOptimal Policy: π*(x)Optimal value function V*(x)Optimal policy:π∗(x) = argmaxaR(x,a) +γP(x'|x,a)V∗(x')x'∑9©2005-2007 Carlos GuestrinInteresting fact – Unique value Slightly surprising fact: There is only one V*that solves Bellman equation! there may be many optimal policies that achieve V* Surprising fact: optimal policies are good everywhere!!!∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ10©2005-2007 Carlos GuestrinSolving an MDP Policy iteration [Howard ‘60, Bellman ‘57]Value iteration [Bellman ‘57]Linear programming [Manne ‘60]…Solve Bellman equationOptimal value V*(x)Optimal policy π*(x)Many algorithms solve the Bellman equations:∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγBellman equation is non-linear!!!11©2005-2007 Carlos GuestrinValue iteration (a.k.a. dynamic programming) –the simplest of all Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||1· ε means that ||V∗-Vt+1||1· ε/(1-γ)∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ12©2005-2007 Carlos GuestrinA simple exampleYou run a startup company.In every state you must choose between Saving money or Advertising.γ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/213©2005-2007 Carlos GuestrinLet’s compute Vt(x) for our example654321Vt(RF)Vt(RU)Vt(PF)Vt(PU)tγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/2∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ14©2005-2007 Carlos GuestrinLet’s compute Vt(x) for our example22.4333.5817.6510.03620.4032.0015.077.22519.2629.6312.203.852418.5525.086.532.0331914.54.5021010001Vt(RF)Vt(RU)Vt(PF)Vt(PU)tγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0SAASAASS1111/21/21/21/21/21/21/21/21/21/2∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ15©2005-2007 Carlos GuestrinWhat you need to know What’s a Markov decision process state, actions, transitions, rewards a policy value function for a policy computing Vπ Optimal value function and optimal policy Bellman equation Solving Bellman equation with value iteration, (other possibilities: policy iteration and linear programming)16©2005-2007 Carlos GuestrinAcknowledgment This lecture contains some material from Andrew Moore’s excellent collection of ML tutorials: http://www.cs.cmu.edu/~awm/tutorials17Reinforcement LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityNovember 29th, 200718©2005-2007 Carlos GuestrinThe Reinforcement Learning taskWorld: You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions.19©2005-2007 Carlos GuestrinFormalizing the (online) reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xtand reward rt you give next action at Goal: (quickly) learn policy that (approximately) maximizes long-term expected discounted reward20©2005-2007 Carlos GuestrinThe “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of my actions along the way actually helped me get there??This is the Credit Assignment problem.“ = 100,“ “ “ 26,“= 2“= 0,“ “ “ 54,“= 2“= 0,“ “ “ 13,“= 1“= 0,“ “ “ 21,“= 1“= 0,“ “ “ 21,“= 1“= 0,“ “ “ 22,“= 4“= 0,“ “ “ 39, action = 2reward = 0,I’m in state 43,21©2005-2007 Carlos GuestrinExploration-Exploitation tradeoff You have visited part of the state space and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at the risk of missing out on some large reward somewhere Exploration: should I look for a region with more reward? at the risk of wasting my time or collecting a lot of negative reward22©2005-2007 Carlos GuestrinTwo main reinforcement learning approaches Model-based approaches: explore environment, then learn model (P(x’|x,a) and R(x,a)) (almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state space is manageable Model-free approach: don’t learn a model, learn
View Full Document