©2005-2007 Carlos Guestrin1Markov DecisionProcesses (MDPs)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 2nd, 20072©2005-2007 Carlos GuestrinJoint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a) Markov Decision Process (MDP) Representation:3©2005-2007 Carlos GuestrinPolicyPolicy: π(x) = aAt state x, action a for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx24©2005-2007 Carlos GuestrinComputing the value of a policyVπ(x0) = Eπ[R(x0) + γ R(x1) + γ2R(x2) + γ3R(x3) + γ4R(x4) + ] Discounted value of a state: value of starting from x0and continuing with policy π from then on A recursion!5©2005-2007 Carlos GuestrinSolving an MDP Policy iteration [Howard ‘60, Bellman ‘57]Value iteration [Bellman ‘57]Linear programming [Manne ‘60]…Solve Bellman equationOptimal value V*(x)Optimal policy π*(x)Many algorithms solve the Bellman equations:∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγBellman equation is non-linear!!!6©2005-2007 Carlos GuestrinValue iteration (a.k.a. dynamic programming) –the simplest of all Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||∞· ε means that ||V∗-Vt+1||∞· ε/(1-γ)∑∗∗+=')'(),|'(),(max)(xaxaxxaxx VPRVγ∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRVγ7©2005-2007 Carlos GuestrinPolicy iteration – Another approach for computing π* Start with some guess for a policy π0 Iteratively say: evaluate policy: improve policy: Stop when policy stops changing usually happens in about 10 iterations or ||Vt+1-Vt||∞· ε means that ||V∗-Vt+1||∞· ε/(1-γ)∑+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRγπ∑=+==')'())(,|'())(,()(xxxaxxxaxxttttVPRVπγπ8©2005-2007 Carlos GuestrinPolicy Iteration & Value Iteration: Which is best ???It depends.Lots of actions? Choose Policy IterationAlready got a fair policy? Policy IterationFew actions, acyclic? Value IterationBest of Both Worlds:Modified Policy Iteration [Puterman]…a simple mix of value iteration and policy iteration3rdApproachLinear Programming9©2005-2007 Carlos GuestrinLP Solution to MDPValue computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Polynomial time solution[Manne ‘60]:subject to:minimize⎩⎨⎧≥∑,∀axx)(xV)(xV)(xV,∀ax)(xV∑+')'(),|'(),(xxaxxax VPRγ10©2005-2007 Carlos GuestrinWhat you need to know What’s a Markov decision process state, actions, transitions, rewards a policy value function for a policy computing Vπ Optimal value function and optimal policy Bellman equation Solving Bellman equation with value iteration, policy iteration and linear programming©2005-2007 Carlos Guestrin11Reinforcement LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 2nd, 200712©2005-2007 Carlos GuestrinThe Reinforcement Learning taskWorld: You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions.13©2005-2007 Carlos GuestrinFormalizing the (online) reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xtand reward rt you give next action at Goal: (quickly) learn policy that (approximately) maximizes long-term expected discounted reward14©2005-2007 Carlos GuestrinThe “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of my actions along the way actually helped me get there??This is the Credit Assignment problem.“ = 100,““ “26,“= 2“= 0,““ “54,“= 2“= 0,““ “13,“= 1“= 0,““ “21,“= 1“= 0,““ “21,“= 1“= 0,““ “22,“= 4“= 0,““ “39, action = 2reward = 0,I’m in state 43,15©2005-2007 Carlos GuestrinExploration-Exploitation tradeoff You have visited part of the state space and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at the risk of missing out on some large reward somewhere Exploration: should I look for a region with more reward? at the risk of wasting my time or collecting a lot of negative reward16©2005-2007 Carlos GuestrinTwo main reinforcement learning approaches Model-based approaches: explore environment → learn model (P(x’|x,a) and R(x,a)) (almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state space is manageable Model-free approach: don’t learn a model → learn value function or policy directly leads to weaker theoretical results often works well when state space is large©2005-2007 Carlos Guestrin17Rmax – A model-based approach18©2005-2007 Carlos GuestrinGiven a dataset – learn model Dataset: Learn reward function: R(x,a) Learn transition model: P(x’|x,a) Given data, learn (MDP) Representation:19©2005-2007 Carlos GuestrinSome challenges in model-based RL 1:Planning with insufficient information Model-based approach: estimate R(x,a) & P(x’|x,a) obtain policy by value or policy iteration, or linear programming No credit assignment problem → learning model, planning algorithm takes care of “assigning” credit What do you plug in when you don’t have enough information about a state? don’t reward at a particular state plug in smallest reward (Rmin)? plug in largest reward (Rmax)? don’t know a particular transition probability?20©2005-2007 Carlos GuestrinSome challenges in model-based RL 2:Exploration-Exploitation tradeoff A state may be very hard to reach waste a lot of time trying to learn rewards and transitions for this state after a much effort, state may be useless A strong advantage of a
View Full Document