DOC PREVIEW
CMU CS 10601 - lecture

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

10-601Machine LearningMarkov decision processes (MDPs)What’s missing in HMMs• HMMs cannot model important aspects of agent interactions:- No model for rewards- No model for actions which can affect these rewards• These are actually issues that are faced by many applications:- Agents negotiating deals on the web- A robot which interacts with its environmentExample: No actionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead00.60.20.7 0.90.20.10.1 0.110.60.10.30.30.10.60.1Formal definition of MDPs• A set of states {s1… sn}• A set of rewards {r1 … rn} • A set of actions {a1.. am}• Transition probability)&|(1, kttjtkjiahiqsqPP One reward for each stateNumber of actions could be larger than number of statesQuestions• What is my expected pay if I am in state i• What is my expected pay if I am in state i and perform action a?Solving MDPs• No actions: Value iteration• With actions: Value iteration, Policy iterationValue computation• An obvious question for such models is what is combined expected value for each state• What can we expect to earn over our life time if we become Asst. prof.?• What if we go to industry?Before we answer this question, we need to define a model for future rewards:• The value of a current award is higher than the value of future awards- Inflation, confidence- Example: LotteryDiscounted rewards• The discounted rewards model is specified using a parameter • Total rewards = current reward + (reward at time t+1) +2(reward at time t+2) +…k(reward at time t+k) + …infinite sumDiscounted rewards• The discounted rewards model is specified using a parameter • Total rewards = current reward + (reward at time t+1) +2(reward at time t+2) +………….k(reward at time t+k) + …infinite sumConverges if 0<<1Determining the total rewards in a state • Define J*(si) = expected discounted sum of rewards when starting at state si• How do we compute J*(si)?))(*)(*)(*()(*2211 niniiiiisJpsJpsJprXrsJHow can we solve this?Factors expected pay for all possible transitions for step iComputing j*(si) ))(*)(*)(*()(*222212122 nnsJpsJpsJprsJ ))(*)(*)(*()(*121211111 nnsJpsJpsJprsJ • We have n equations with n unknowns • Can be solved in closed form))(*)(*)(*()(*2211 nnnnnnnsJpsJpsJprsJ Iterative approaches• Solving in closed form is possible, but may be time consuming.• It also doesn’t generalize to non-linear models• Alternatively, this problem can be solved in an iterative manner• Lets define Jt(si) as the expected discounted rewards after t steps• How can we compute Jt(si)?iirSJ )(1kkkiiisJprSJ )()(1,2kktkiiitsJprSJ )()(,1Iterative approaches• Solving in closed form is possible, but may be time consuming.• Alternatively, this problem can be solved in an iterative manner• Lets define Jt(si) as the expected discounted awards after t steps• How can we compute Jt(si)?iirSJ )(1kkkiiisJprSJ )()(1,2kktkiiitsJprSJ )()(,1We know how to solve this!Let’s fill the dynamic programming tableBut wait …This is a never ending task!When do we stop?iirSJ )(1kkkiiisJprSJ )()(1,2kktkiiitsJprSJ )()(,1Remember, we have a converging functionWe can stop when |Jt-1(si)- Jt(si)|< Infinity norm selects maximal elementExample for =0.9Graduate student20Asst. prof40Google200Dead00.90.20.80.20.10.110.60.1tJt(Gr) Jt(P) Jt(Goo) Jt(D)1 20 40 200 02 74 87 362 03 141 135 493 04 209 182 600 0J2(Gr)=20+0.9*(0.6*20+0.2*40+0.2*200)Solving MDPs• No actions: Value iteration• With actions: Value iteration, Policy iterationAdding actionsA Markov Decision Process:• A set of states {s1… sn}• A set of rewards {r1 … rn} • A set of actions {a1.. am}• Transition probability)&|(1, kttjtkjiahiqsqPP Example: ActionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead0Action A0.70.7 0.90.30.20.110.10.30.30.10.60.1Action B0.60.10.1Action A: Leave to GoogleAction B: Stay in academia0.8Questions for MDPs• Now we have actions• The question changes to the following:Given our current state and the possible actions, what is the best action for us in terms of long term payment?Example: ActionsAction A: Leave to GoogleAction B: Stay in academiaGraduate student20Asst. prof40Google200Dead00.90.80.10.110.10.70.3Action BAction A0.80.2So should you leave now (right after class) or should you stay at CMU?Policy• A policy maps states to actions• An optimal policy leads to the highest expected returns• Note that this does not depend on the start stateGr BGo AAsst. Pr. ATen. Pr. BSolving MDPs with actions• It could be shown that for every MDP there exists an optimal policy (we won’t discuss the proof).• Such policy guarantees that there is no other action that is expected to yield a higher payoffComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkij as the probability of transitioning from state i to state j when using action k• Then we compute:jjtkikitsJprSJji)()(,max1Also known as Bellman’s equationUse probabilities associated with action kComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkij as the probability of transitioning from state i to state j when using action k• Then we compute:jjtkikitsJprSJji)()(,max1Run until convergenceComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkij as the probability of transitioning from state i to state j when using action k• Then we compute:jjtkikitsJprSJji)()(,max1• When the algorithm converges, we have computed the best outcome for each state• We associate states with the actions that maximize their returnValue iteration for =0.9Graduate student20Asst. prof40Google200Dead00.90.80.10.110.1tJt(Gr) Jt(P) Jt(Goo) Jt(D)1 20 40 200 02168(A)51(B)87 362 03311(A)120(B)135 493 04431(A)189(B)182 600 00.70.3Action BAction


View Full Document

CMU CS 10601 - lecture

Documents in this Course
lecture

lecture

40 pages

Problem

Problem

12 pages

lecture

lecture

36 pages

Lecture

Lecture

31 pages

Review

Review

32 pages

Lecture

Lecture

11 pages

Lecture

Lecture

18 pages

Notes

Notes

10 pages

Boosting

Boosting

21 pages

review

review

21 pages

review

review

28 pages

Lecture

Lecture

31 pages

lecture

lecture

52 pages

Review

Review

26 pages

review

review

29 pages

Lecture

Lecture

37 pages

Lecture

Lecture

35 pages

Boosting

Boosting

17 pages

Review

Review

35 pages

lecture

lecture

32 pages

Lecture

Lecture

28 pages

Lecture

Lecture

30 pages

lecture

lecture

29 pages

leecture

leecture

41 pages

review

review

38 pages

review

review

31 pages

Lecture

Lecture

41 pages

Lecture

Lecture

15 pages

Lecture

Lecture

21 pages

Lecture

Lecture

38 pages

Notes

Notes

37 pages

lecture

lecture

29 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?