DOC PREVIEW
MIT 16 412J - Solving POMDPs through Macro Decomposition

This preview shows page 1-2-3-4-5-32-33-34-35-64-65-66-67-68 out of 68 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 68 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Decomposition Larry Bush Tony Jimenez Brian Bairstow Solving POMDPs through Macro POMDPs are a planning technique that accounts for uncertainty in the world. While they have potential, they are very computationally complex. Macro operators can reduce this complexity. 1Outline • • • Introduction to POMDPs – Brian Bairstow Demonstration of POMDPs – Larry Bush Approximating POMDPs with Macro Actions – Tony Jimenez We will begin with an overview of MDP and POMDPs, and then a visual demonstration of simple MDPs and POMDPs. Finally we will discuss our advanced topic in POMDPS: the approximation of POMDPs using macro actions. 2• • • Introduction to POMDPs Introduction to POMDPs – Markov Decision Processes (MDPs) – Value Iteration – Partially Observable Markov Decision Processes (POMDPs) – Overview of Techniques Demonstration of POMDPs Approximating POMDPs with Macro Actions Begin with completely observable Markov Decision Processes, or MDPs. Then discuss value iteration as a method of solving MDPs. This will lay the groundwork for Partially Observable Markov Decision Processes, or POMDPs. Finally there will be an overview of methods for solving POMDPs. 3Navigation of a Building • Robot get there? – Knows building map – Wants to reach goal – Uncertainty in actions – What is the best way to Imagine a robot (in the lower right of the graphic) trying to navigate a building. It wants to reach a goal (the star in the graphic). It has difficulties in that there are uncertainties in its actions, for example its wheels could slip or catch. This might cause it to run into a wall, which is undesirable. The problem then is to find the best way to reach the goal while dealing with the problem of uncertainty in actions. How can we solve this problem? 4Markov Decision Processes • Model – States, S – Actions, A – Transition Probabilities, p(s,a) – Rewards, r(s,a) • Process – t in S – Choose action at in A – t t ,at) – State becomes st+1 according to probabilities p(s,a) • Goal p for choosing actions that maximizes the lifetime reward • Discount Factor g Value = r0 + gr1 + g2r2 + … Observe state sReceive reward r = r(s– Create a policy We can use a MDP to solve the previous problem. An MDP consists of a model with states, actions, transitions, and expected rewards. The states in the previous problem could be positions on the map. States are discrete, so the map would have to be divided into a grid or something similar. The actions then could be to move north, east, south, or west on the map. The transition probabilities tell you the chances that a certain action from a certain state takes you to different other states. For instance, if the robot was commanded to move north, there might be a large probability that it transitions to the next state north, and small probabilities it ends up east, west, or not move at all. The reward function tells you the expected reward received by taking an action from a state. In the previous problem this could be a large reward for reaching the goal, and a large negative reward for hitting a wall. The process in carrying out an MDP solution is to observe the state in time step t, to choose an appropriate action, to receive the reward corresponding to that state and action, and to change the state according to the transition probabilities. Note that the robot is in exactly one state at a time, that time is discrete, and all actions take one time step. The goal in solving an MDP is to create a policy (a method for c hoosing actions) that maximizes the expected lifetime reward. The lifetime reward is the sum of all rewards received. Thus a policy should not always maximize immediate reward, but also plan ahead. Future rewards are discounted by a factor for each time step they are in the future. This follows an economic principle of the effect of time on values. Also, this makes the math work in calculating lifetime reward. Without discounting, it could be possible to receive a small reward over and over to get an infinite reward, and this is not useful in choosing a policy. A typical discount factor might be 0.9. 5MDP Model Example • States A, B, C • Actions 1, 2 • Transition probabilities p and rewards r in diagram • C is terminal state A B C 2 1 1 2 p = 0.3 p = 0.7 p = 0.4 p = 0.6 p = 0.5 p = 0.5 p = 0.7p = 0.3 r = 1 r = 1 r = 1 r = 1 r = 0.2 r = 0.2 r = 0 r = 0 This is a simple example of an MDP model (unrelated to the previous robot example). There are 3 states and two actions. The probabilities and rewards are written in the diagram. For example, from state A, if action 1 is chosen then there is a 70% chance of staying in state A with 0 reward, and a 30% chance of moving to state C with 1 reward. State C is the termi nal state, since there are no actions that move out of state C. 6Decision Tree Representation of MDP A 1 2 A C 0.7 0.3 C B0.6 0.4 1 1 0.2 0 0.3 0.52 0.52 RewardExpected Reward Max Q Value A Decision tree is another form of visualization, and allows you to evaluate the values of states. Here is a 1 step horizon decision tree for the simple model show n. From the starting point A there are two actions, 1 and 2. The action chosen gives probabilities of moving to different states. Notice that the rewards for moving to the states are listed in right at the leaf nodes. Expected rewards can then be calculated for taking actions. For instance, for action 1 the expected reward is .7(0) + .3(1) = 0.3. When the expected rewards for the actions are known, then the highest reward path should be followed. This means that in a 1 step problem at state A, action 2 should be taken, and state A has a value of 0.52, equal to the expected reward of taking action 2. However this is only a one step problem. We have not considered that it is important what state you end up in because of future rewards. For instance, we have just discovered that state A has a value of .52, but in the tree in the upper right we have evaluated it as having 0 value. Thus we need to look at a larger horizon problem. 7Decision Tree Representation of MDP A 1 2 A C 0.7 0.3 C B0.6 0.4 1 1 0.9 0.52 0.664 0.94 0.94 Max Q ValueTotal Future Reward Max Q Value 1 2 A C 0.7 0.3 1 0 B C 0.6 0.4 1 0.20.52 0.3 1 2 A C 0.3 0.7 1 0 B C 0.5 0.5 1 0.20.6 0.7 Expected Reward Reward Now we have a decision tree for a 2 step horizon starting at sta te A. For simplicity a discount value of …


View Full Document

MIT 16 412J - Solving POMDPs through Macro Decomposition

Documents in this Course
Load more
Download Solving POMDPs through Macro Decomposition
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Solving POMDPs through Macro Decomposition and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Solving POMDPs through Macro Decomposition 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?