MIT 6 825 - Markov Decision Processes - D2335201

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 825> Markov Decision Processes

MIT 6 825 - Markov Decision Processes

School name Massachusetts Institute of Technology

Course 6 825- Techniques in Artificial Intelligence

Pages 65

Download Save

Unformatted text preview:

1Lecture 20 • 16.825 Techniques in Artificial IntelligenceMarkov Decision Processes• Framework• Markov chains•MDPs• Value iteration•ExtensionsNow we’re going to think about how to do planning in uncertain domains. It’s an extension of decision theory, but focused on making long-term plans of action. We’ll start by laying out the basic framework, then look at Markov chains, which are a simple case. Then we’ll explore what it means to have an optimal plan for an MDP, and look at an algorithm, called value iteration, for finding optimal plans. We’ll finish by looking at some of the major weaknesses of this approach and seeing how they can be addressed.2Lecture 20 • 2MDP FrameworkA Markov decision process (known as an MDP) is a discrete-time state-transition system. It can be described formally with 4 components.3Lecture 20 • 3MDP Framework•S : statesFirst, it has a set of states. These states will play the role of outcomes in the decision theoretic approach we saw last time, as well as providing whatever information is necessary for choosing actions. For a robot navigating through a building, the state might be the room it’s in, or the x,y coordinates. For a factory controller, it might be the temperature and pressure in the boiler. In most of our discussion, we’ll assume that the set of states is finite and not too big to enumerate in our computer.4Lecture 20 • 4MDP Framework•S : states•A : actionsNext, we have a set of actions. These are chosen, in the simple case, from a small finite set.5Lecture 20 • 5MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilitiesThe transition probabilities describe the dynamics of the world. They play the role of the next-state function in a problem-solving search, except that every state is thought to be a possible consequence of taking an action in a state. So, we specify, for each state s_t and action a_t, the probability that the next state will be s_t+1. You can think of this as being represented as a set of matrices, one for each action. Each matrix is square, indexed in both dimensions by states. If you sum over all states s_I, then the sum of the probabilities that S_I is the next state, given a particular previous state and action is 1.6Lecture 20 • 6MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilities= Pr(st+1| s0… st, a0… at) Markov propertyThese processes are called Markov, because they have what is known as the Markov property. that is, that given the current state and action, the next state is independent of all the previous states and actions. The current state captures all that is relevant about the world in order to predict what the next state will be.7Lecture 20 • 7MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilities= Pr(st+1| s0… st, a0… at) Markov property• R(s) : real-valued rewardFinally, there is a real-valued reward function on states. You can think of this is as a short-term utility function. How good is it, from the agent’s perspective to be in state s? That’s R(s).8Lecture 20 • 8MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilities= Pr(st+1| s0… st, a0… at) Markov property• R(s) : real-valued rewardFind a policy: ∏: S → AThe result of classical planning was a plan. A plan was either an ordered list of actions, or a partially ordered set of actions, meant to be executed without reference to the state of the environment. When we looked at conditional planning, we considered building plans with branches in them, that observed something about the state of the world and acted differently depending on the observation. In an MDP, the assumption is that you could potentially go from any state to any other state in one step. And so, to be prepared, it is typical to compute a whole policy, rather than a simple plan. A policy is a mapping from states to actions. It says, no matter what state you happen to find yourself in, here is the action that it’s best to take now. Because of the Markov property, we’ll find that the choice of action only needs to depend on the current state (and possibly the current time), but not on any of the previous states.9Lecture 20 • 9MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilities= Pr(st+1| s0… st, a0… at) Markov property• R(s) : real-valued rewardFind a policy: ∏: S → AMaximize •Myopic: E[rt| ∏, st] for all sSo, what is our criterion for finding a good policy? In the simplest case, we’ll try to find the policy that maximizes, for each state, the expected reward of executing that policy in the state. This is a particularly easy problem, because it completely decomposes into a set of decision problems: for each state, find the single action that maximizes expected reward. This is exactly the single-action decision theory problem that we discussed before.10Lecture 20 • 10MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilities= Pr(st+1| s0… st, a0… at) Markov property• R(s) : real-valued rewardFind a policy: ∏: S → AMaximize •Myopic: E[rt| ∏, st] for all s• Finite horizon: E[∑kt=0rt| ∏, s0]It’s not usually good to be so short sighted, though! Let’s think a little bit farther ahead. We can consider policies that are “finite-horizon optimal” for a particular horizon k. That means, that we should find a policy that, for every initial state s0, results in the maximal expected sum of rewards from times 0 to k.11Lecture 20 • 11MDP Framework•S : states•A : actions•Pr(st+1| st, at) : transition probabilities= Pr(st+1| s0… st, a0… at) Markov property• R(s) : real-valued rewardFind a policy: ∏: S → AMaximize •Myopic: E[rt| ∏, st] for all s• Finite horizon: E[∑kt=0rt| ∏, s0]– Non-stationary policy: depends on timeSo, if the horizon is 2, we’re maximizing over our reward today and tomorrow. If it’s 300, then we’re looking a lot farther ahead. We might start out by doing some actions that have very low rewards initially, because by doing them we are likely to be taken to states that will ultimately result in high reward. (Students are often familiar with the necessity of passing through low-reward states in order to get to states with higher reward!). Because we will, in general, want to choose actions differently at the very

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 825 - Markov Decision Processes

Sign up for free to view:

Please select your school