©2005-2007 Carlos Guestrin1Markov DecisionProcesses (MDPs)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 2nd, 20072©2005-2007 Carlos GuestrinJoint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a)Markov Decision Process (MDP) Representation:3©2005-2007 Carlos GuestrinPolicyPolicy: π(x) = aAt state x,action a for allagentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx24©2005-2007 Carlos GuestrinComputing the value of a policyVπ(x0) = Eπ[R(x0) + γ R(x1) + γ2 R(x2) + γ3 R(x3) + γ4 R(x4) + L] Discounted value of a state: value of starting from x0 and continuing with policy π from then on A recursion!5©2005-2007 Carlos GuestrinSolving an MDP Policy iteration [Howard ‘60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ‘60] …SolveBellmanequationOptimalvalue V*(x)Optimalpolicy π *(x)Many algorithms solve the Bellman equations:!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#Bellman equation is non-linear!!!6©2005-2007 Carlos GuestrinValue iteration (a.k.a. dynamic programming) –the simplest of all Start with some guess V0 Iteratively say: Stop when ||Vt+1-Vt||1 · ε means that ||V∗-Vt+1||1 · ε/(1-γ)!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#!+=+'1)'(),|'(),(max)(xaxaxxaxxttVPRV"7©2005-2007 Carlos GuestrinPolicy iteration – Another approach forcomputing π* Start with some guess for a policy π0 Iteratively say: evaluate policy: improve policy: Stop when policy stops changing usually happens in about 10 iterations or ||Vt+1-Vt||1 · ε means that ||V∗-Vt+1||1 · ε/(1-γ)!+=+'1)'(),|'(),(max)(xaxaxxaxxttVPR"#!=+==')'())(,|'())(,()(xxxaxxxaxxttttVPRV"#"8©2005-2007 Carlos GuestrinPolicy Iteration & Value Iteration:Which is best ???It depends.Lots of actions? Choose Policy IterationAlready got a fair policy? Policy IterationFew actions, acyclic? Value IterationBest of Both Worlds:Modified Policy Iteration [Puterman]…a simple mix of value iteration and policy iteration3rd ApproachLinear Programming9©2005-2007 Carlos GuestrinLP Solution to MDPValue computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Polynomial time solution[Manne ‘60] :subject to:minimize≥∑,∀axx)(xV)(xV)(xV,∀ax)(xV!+')'(),|'(),(xxaxxax VPR"10©2005-2007 Carlos GuestrinWhat you need to know What’s a Markov decision process state, actions, transitions, rewards a policy value function for a policy computing Vπ Optimal value function and optimal policy Bellman equation Solving Bellman equation with value iteration, policy iteration and linearprogramming©2005-2007 Carlos Guestrin11ReinforcementLearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 2nd, 200712©2005-2007 Carlos GuestrinThe Reinforcement Learning taskWorld: You are in state 34.Your immediate reward is 3. You have possible 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have possible 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).Your immediate reward is 3. You have possible 3 actions.13©2005-2007 Carlos GuestrinFormalizing the (online)reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xt and reward rt you give next action at Goal: (quickly) learn policy that (approximately)maximizes long-term expected discounted reward14©2005-2007 Carlos GuestrinThe “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of myactions along the way actually helped me get there??This is the Credit Assignment problem.“ = 100,“ “ “ 26, “ = 2“ = 0,“ “ “ 54, “ = 2“ = 0,“ “ “ 13, “ = 1“ = 0,“ “ “ 21, “ = 1“ = 0,“ “ “ 21, “ = 1“ = 0,“ “ “ 22, “ = 4“ = 0,“ “ “ 39,action = 2reward = 0,I’m in state 43,15©2005-2007 Carlos GuestrinExploration-Exploitation tradeoff You have visited part of the statespace and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick withwhat I know and find a goodpolicy w.r.t. this knowledge? at the risk of missing out on somelarge reward somewhere Exploration: should I look for aregion with more reward? at the risk of wasting my time orcollecting a lot of negative reward16©2005-2007 Carlos GuestrinTwo main reinforcement learningapproaches Model-based approaches: explore environment ! learn model (P(x’|x,a) and R(x,a))(almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state space is manageable Model-free approach: don’t learn a model ! learn value function or policy directly leads to weaker theoretical results often works well when state space is large©2005-2007 Carlos Guestrin17Rmax – A model-based approach18©2005-2007 Carlos GuestrinGiven a dataset – learn model Dataset: Learn reward function: R(x,a) Learn transition model: P(x’|x,a)Given data, learn (MDP) Representation:19©2005-2007 Carlos GuestrinSome challenges in model-based RL 1:Planning with insufficient information Model-based approach: estimate R(x,a) & P(x’|x,a) obtain policy by value or policy iteration, or linear programming No credit assignment problem ! learning model, planning algorithm takes care of“assigning” credit What do you plug in when you don’t have enough information about a state? don’t reward at a particular state plug in smallest reward (Rmin)? plug in largest reward (Rmax)? don’t know a particular transition probability?20©2005-2007 Carlos GuestrinSome challenges in model-based RL 2:Exploration-Exploitation tradeoff A state may be very hard to reach waste a lot of time trying to learn rewards
View Full Document