DOC PREVIEW
UT Dallas CS 6375 - 20.mdp

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning CS6375 Spring 2015 Markov Decision Processes Acknowledge slides based on Andrew Moore s MDP tutorial 1 Rewards An assistant professor gets paid say 20K per year How much in total will the A P earn in their life 20 20 20 20 20 Infinity What s wrong with this argument 2 1 Discounted Rewards A reward payment in the future is not worth quite as much as a reward now Because of chance of obliteration Because of inflation Example Being promised 10 000 next year is worth only 90 as much as receiving 10 000 right now Assuming payment n years in future is worth only 0 9 n of payment now what is the AP s Future Discounted Sum of Rewards 3 Discount Factors People in economics and probabilistic decision making do this all the time The Discounted sum of future rewards using discount factor is reward now reward in 1 time step 2 reward in 2 time steps 3 reward in 3 time steps infinite sum 4 2 Discounting Always converges if 1 and the reward function R is bounded close to 0 instant gratification don t pay attention to future reward close to 1 extremely conservative consider profits losses no matter how far in the future The resulting model is the discounted reward 5 The Academic Life Define JA Expected discounted future rewards starting in state A JB Expected discounted future rewards starting in state B JT T JS S JD D How do we compute JA JB JT JS JD 6 3 A Markov System with Rewards Has a set of states S1 S2 SN Has a transition probability matrix P11 P12 P1N P P21 Pij Prob NextState Sj This Si PN1 PNN Each state has a reward r1 r2 rN There s a discount factor 0 1 On Each Time Step 0 Assume your state is Si 1 You get given reward ri 2 You randomly move to another state P NextState Sj This Si Pij 3 All future rewards are discounted by 7 Solving a Markov System Write J Si expected discounted sum of future rewards starting in state Si J Si ri x Expected future rewards starting from your next state ri Pi1J S1 Pi2J S2 PiNJ SN Using vector notation write J S1 r1 P11 P12 P1N J J S2 R r2 P P21 J SN rN PN1 PN2 PNN Question can you get a closed form expression for J in terms of R P and 8 4 Solving a Markov System with Matrix Inversion J R P J Upside You get an exact answer Downside If you have 100 000 states you re solving a 100 000 by 100 000 system of equations 9 Value Iteration another way to solve a M S Define J1 Si Expected discounted sum of rewards over the next 1 time step J2 Si Expected discounted sum of rewards during next 2 steps J3 Si Expected discounted sum of rewards during next 3 steps Jk Si Expected discounted sum of rewards during next k steps J 1 S i ri N J 2 Si ri pij J 1 s j N of states j 1 N J k 1 S i ri pij J k s j j 1 10 5 Let s do Value Iteration 11 Value Iteration Example 12 6 Value Iteration for Solving Markov Systems Compute J1 Si for each j Compute J2 Si for each j Compute Jk Si for each j As k Jk Si J Si When to stop When Max i Jk 1 Si Jk Si This is faster than matrix inversion if the transition matrix is sparse 13 Different Types of Markov Models State No control control Fully observable Markov systems MDPs Hidden HMMs POMDPs 14 7 Markov Decision Processes An Example I run a startup company I can choose to either save money or spend money on advertising If I advertise I may become famous 50 prob but will spend money so I may become poor If I save money I may become rich 50 prob but I may also become unknown because I don t advertise What should I do 15 A Markov Decision Process You run a startup company In every state you must choose between saving money or advertising 16 8 Markov Decision Processes An MDP has A set of states s1 SN A set of actions a1 aM A set of rewards r1 rN one for each state A transition probability function On each step 0 Call current state Si 1 Receive reward ri 2 Choose action a1 aM 3 If you choose action ak you ll move to state Sj with probability 4 All future rewards are discounted by 17 What we are looking for Policy A policy is a mapping from states to actions How many policies Which one is the best policy and how to compute it 18 9 Optimal Decision Policy Policy Mapping from states to action s a Which action should I take in each state Intuition encodes the best action that we can take from any state to maximize future rewards Optimal Policy The policy that maximizes the expected utility J s of the sequence of states generated by starting at s 19 Example Finance and Business States Status of the company cash reserves inventory etc Actions Business decisions advertise acquire other companies roll out product etc Uncertainty due to all the external uncontrollable factors economy shortages consumer confidence Optimal policy The policy for making business decisions that maximizes the expected future profits 20 10 Example Robotics States are 2 D positions Actions are commanded motions turn by x degrees move y meters Uncertainty comes from the fact that the mechanism is not perfect slippage etc and does not execute the commands exactly Reward when avoiding forbidden regions for example Optimal policy The policy that minimizes the cost to the goal 21 Key Results For every MDP there exists an optimal policy There is no better option in terms of expected sum of rewards than to follow this policy How to compute the optimal policy 22 11 Computing the Optimal Policy Idea One Run through all possible policies Select the best Note for a given policy it s a Markov system can calculate expected discounted reward Not tractable 23 Computing the Optimal Value Function with Value Iteration Define J Si Expected Discounted Future Rewards starting from state Si assuming we use the optimal policy Define Jn Si Maximum possible expected sum of discounted rewards I can get if I start at state Si and I live for n time steps Note that J1 Si ri 24 12 Bellman s Equation Value Iteration for solving MDPs Compute J1 Si for all i Compute J2 Si for all i Compute Jn Si for all i until converged 25 Computing Jn Si for our example 26 13 Compute Jn Si for our example n Jn PU Jn PF Jn RU Jn RF 1 0 0 10 10 2 0 4 5 14 5 19 3 2 03 8 55 16 53 25 08 4 27 Convergence Results Exponentially fast convergence Slower convergence as increases 28 14 Finding the …


View Full Document

UT Dallas CS 6375 - 20.mdp

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download 20.mdp
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 20.mdp and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 20.mdp and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?