DOC PREVIEW
UT Dallas CS 6375 - mdp

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

11Machine LearningCS6375 --- Spring 2015Markov Decision ProcessesAcknowledge: slides based on Andrew Moore’s MDP tutorial2RewardsAn assistant professor gets paid, say, 20K per year.How much, in total, will the A.P. earn in their life?20 + 20 + 20 + 20 + 20 + … = InfinityWhat’s wrong with this argument?23Discounted Rewards“A reward (payment) in the future is not worth quiteas much as a reward now.”• Because of chance of obliteration• Because of inflationExample:Being promised $10,000 next year is worth only 90% asmuch as receiving $10,000 right now.Assuming payment n years in future is worth only(0.9)nof payment now, what is the AP’s FutureDiscounted Sum of Rewards ?4Discount FactorsPeople in economics and probabilistic decision makingdo this all the time.The Discounted sum of future rewards usingdiscount factor γ is(reward now) +γ (reward in 1 time step) +γ2(reward in 2 time steps) +γ3(reward in 3 time steps) +:: (infinite sum)35Discounting• Always converges if γ < 1 and the reward function, R(.), is bounded• γ close to 0  instant gratification, don’t pay attention to future reward• γ close to 1  extremely conservative, consider profits/losses no matter how far in the future• The resulting model is the discounted reward6The Academic LifeDefine:JA= Expected discounted future rewards starting in state AJB= Expected discounted future rewards starting in state BJT= “ “ “ “ “ “ “ TJS= “ “ “ “ “ “ “ SJD= “ “ “ “ “ “ “ DHow do we compute JA, JB, JT, JS, JD?47A Markov System with Rewards• Has a set of states { S1S2·· SN }• Has a transition probability matrixP11P12·· P1NP= P21Pij= Prob(NextState = Sj| This = Si):PN1·· PNN• Each state has a reward. { r1r2·· rN}• There’s a discount factor γ. 0 < γ < 1On Each Time Step …0. Assume your state is Si1. You get given reward ri2. You randomly move to another stateP(NextState = Sj| This = Si) = Pij3. All future rewards are discounted by γ8Solving a Markov SystemWrite J*(Si) = expected discounted sum of future rewards starting in state SiJ*(Si) = ri+ γ x (Expected future rewards starting from your next state)= ri+ γ(Pi1J*(S1)+Pi2J*(S2)+ ··· PiNJ*(SN))Using vector notation, writeJ*(S1) r1P11P12·· P1NJ = J*(S2) R = r2P= P21·. : : :J*(SN) rNPN1PN2·· PNNQuestion: can you get a closed form expression for J in terms of R, P and γ ?59Solving a Markov System with Matrix Inversion• Upside: You get an exact answer• Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations.JPRJ∗∗+=γ10Value Iteration: another way to solve a M.S.DefineJ1(Si) = Expected discounted sum of rewards over the next 1 time step.J2(Si) = Expected discounted sum of rewards during next 2 stepsJ3(Si) = Expected discounted sum of rewards during next 3 steps:Jk(Si) = Expected discounted sum of rewards during next k steps∑∑=+=+=+==NjjkijiikNjjijiiiisJprSJNsJprSJrSJ111121)()(states of #:)()()(γγ611Let’s do Value Iteration12Value Iteration Example713Value Iteration for Solving Markov Systems• Compute J1(Si) for each j• Compute J2(Si) for each j:• Compute Jk(Si) for each jAs k→∞ Jk(Si)→J*(Si) . When to stop? WhenMax i| Jk+1(Si) – Jk(Si) | < ξThis is faster than matrix inversion if the transition matrix is sparse14Different Types of Markov ModelsState No control controlFully observable Markov systems MDPsHidden HMMs POMDPs815Markov Decision Processes: An Example• I run a startup company• I can choose to either save money or spend money on advertising• If I advertise, I may become famous (50% prob.) but will spend money so I may become poor• If I save money, I may become rich (50% prob.), but I may also become unknown because I don’t advertise• What should I do?16A Markov Decision ProcessYou run a startupcompany.In every state youmust choosebetween savingmoney oradvertising.917Markov Decision ProcessesAn MDP has…• A set of states {s1··· SN}• A set of actions {a1··· aM}• A set of rewards {r1··· rN} (one for each state)• A transition probability functionOn each step:0. Call current state Si1. Receive reward ri2. Choose action ∈ {a1··· aM}3. If you choose action akyou’ll move to state Sjwith probability4. All future rewards are discounted by γ18What we are looking for: PolicyA policy is a mapping from states to actions.• How many policies?• Which one is the best policy and how to compute it?1019Optimal Decision Policy• Policy π = Mapping from states to action π(s) = a Which action should I take in each state?• Intuition: π encodes the best action that we can take from any state to maximize future rewards• Optimal Policy = The policy π* that maximizes the expected utility J(s) of the sequence of states generated by π*, starting at s20Example: Finance and Business• States: Status of the company (cash reserves, inventory, etc.)• Actions: Business decisions (advertise, acquire other companies, roll out product, etc.)• Uncertainty due to all the external uncontrollable factors (economy,shortages, consumer confidence)• Optimal policy: The policy for making business decisions that maximizes the expected future profits1121Example: Robotics• States are 2-D positions• Actions are commanded motions (turn by x degrees, move y meters)• Uncertainty comes from the fact that the mechanism is not perfect (slippage, etc.) and does not execute the commands exactly• Reward when avoiding forbidden regions (for example)• Optimal policy: The policy that minimizes the cost to the goal22Key Results• For every MDP, there exists an optimal policy• There is no better option (in terms of expected sum of rewards) than to follow this policy• How to compute the optimal policy?1223Computing the Optimal PolicyIdea One:– Run through all possible policies.– Select the best.Note: for a given policy, it’s a Markov system – can calculate expected discounted reward.Not tractable.24Computing the Optimal Value Function with Value IterationDefine J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policyDefineJn(Si) = Maximum


View Full Document

UT Dallas CS 6375 - mdp

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download mdp
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view mdp and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view mdp 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?