This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Markov Decision Processes: Making Decision in the Presence of Uncertainty(some of) R&N 16.1-16.6 R&N 17.1-17.4Decision Processes: General Description• Suppose that you own a business. At any time, you know exactly the current state of the business (finances, stock, etc.). • At any time, you have a choice of several possible actions (advertise, introduce new products,…). • You cannot predict with certainty the consequence of these actions given the current state of the business, but you have a guess as to the likelihood of the possible outcomes. • How can you define a policy that will guarantee that you always choose the action that maximizes expected future profits? Note: Russel & Norvig, Chapter 17.2Decision Processes: General Description• Decide what action to take next, given:– A probability to move to different states– A way to evaluate the reward of being in different statesRobot path planningTravel route planningElevator schedulingAircraft navigationManufacturing processesNetwork switching & routingExample• Assume that time is discretized into discrete time steps t =1,2,…. (for example, fiscal years)• Suppose that the business can be in one of a finite number of states s (this is a major simplification, but let’s assume….)• Suppose that for every state s, we can anticipate a reward that the business receives for being in that state:R(s) (in this example, R(s) would the profit, possibly negative, generated by the business)• Assume also that R(s) is bounded (R(s) < M for all s),meaning that the business cannot generate more than a certain profit threshold• Question: What is the total value of the reward for a particular configuration of states {s1,s2,…} over time?3Example• Question: What is the total value of the reward for a particular configuration of states {s1,s2,…} over time?• It is simply the sum of the rewards (possibly negative) that we will receive in the future:U(s1,s2,.., sn,..) = R(s1)+R(s2)+..+R(sn) +....What is wrong with this formula???Horizon ProblemU(s0,…, sN) = R(s0)+R(s1)+…+R(sN)Need to know N, the length of the sequence (finite horizon)The sum may be arbitrarily large depending on N4Horizon Problem• The problem is that we did not put any limit on the “future”, so this sum can be infinite.• For example: Consider the simple case of computing the total future reward if the business remains forever in the same state:U(s,s,.., s,..) = R(s)+R(s)+..+ R(s) +…...is clearly infinite in general!!• This definition is useless unless we consider a finite time horizon.• But, in general, we don’t have a good way to define such a time horizon.DiscountingU(s0,….)=R(s0)+γγγγR(s1)+..+γγγγNR(sN)+..Discount factor 0 < γγγγ < 1The length of the sequence is arbitrary(infinite horizon)5Discounting• U(s0,…..) = R(s0) + γγγγR(s1) +….+ γγγγNR(sN) + ….• Always converges if γ < 1 and R(.) is bounded• γ close to 0  instant gratification, don’t pay attention to future reward• γ close to 1  extremely conservative, consider profits/losses no matter how far in the future• The resulting model is the discounted reward• Prefers expedient solutions (models impatience)• Compensates for uncertainty in available time (models mortality)• Economic example:– Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now.– Assuming payment n years in future is worth only (0.9)nof payment nowActions• Assume that we also have a finite set of actions a• An action a causes a transition from a state s to a state s’• In the “business” example, an action may be placing advertising, or selling stock, etc.6The Basic Decision Problem• Given:– Set of states S = {s}– Set of actions A = {a} a: S  S– Reward function R(.)– Discount factor γγγγ– Starting state s1• Find a sequence of actions such that the resulting sequence of states maximizesthe total discounted reward:U(s0,….)=R(s0)+γγγγR(s1)+..+γγγγNR(sN)+..•In the “business”example: Find a Maze Example: Utility• Define the reward of being in a state:– R(s) = -0.04 if s is empty state– R(4,3) = +1 (maximum reward when goal is reached)– R(4,2) = -1 (avoid (4,2) as much as possible)• Define the utility of a sequence of states:– U(s0,…, sN) = R(s0) + R(s1) +….+R(sN)+1-13211 2 3 47Maze Example: Utility• Define the reward of being in a state:– R(s) = -0.04 if s is empty state– R(4,3) = +1 (maximum reward when goal is reached)– R(4,2) = -1 (avoid (4,2) as much as possible)• Define the utility of a sequence of states:– U(s0,…, sN) = R(s0) + R(s1) +….+R(sN)START+1-13211 2 3 4If no uncertainty:Find the sequence of actions that maximizes the sum of the rewards of the traversed statesMaze Example: No Uncertainty• States: locations in maze grid• Actions: Moves up/left left/right• If no uncertainty: Find sequence of actions from current state to goal (+1) that maximizes utility  We know how to do this using earlier search techniquesSTART+1-13211 2 3 48What we are looking for: Policy• Policy = Mapping from states to action π(s) = a Which action should I take in each state • In the maze example, π(s) associates a motion to a particular location on the grid• For any state s, we define the utility U(s) of s as the sum of discounted rewards of the sequence of states starting at state s generated by using the policy πU(s) = R(s) + γγγγ R(s1) + γγγγ2R(s2) +…..• Where we move from s to s1 by action π(s)• We move from s1to s2by action π(s1)• …etc.Optimal Decision Policy• Policy = Mapping from states to action π(s) = a Which action should I take in each state • Intuition: π encodes the best action that we can take from any state to maximize future rewards• In the maze example, π(s) associates a motion to a particular location on the grid• Optimal Policy = The policy π* that maximizes the expected utility U(s) of the sequence of states generated by π*, starting at s• In the maze example, π*(s) tells us which motion to choose at every cell of the grid to bring us closer to the goal9Maze Example: No Uncertainty• π*((1,1)) = UP• π*((1,3)) = RIGHT• π*((4,1)) = LEFTSTART+1-13211 2 3 4Maze Example: With Uncertainty• The robot may not execute exactly the action that is commanded  The outcome of an action is no longer deterministic• Uncertainty:– We know in which state we are (fully


View Full Document

CMU CS 15381 - mdp

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download mdp
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view mdp and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view mdp 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?