DOC PREVIEW
CMU CS 10701 - mdps

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 42 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

©2005-2007 Carlos Guestrin1Markov DecisionProcesses (MDPs)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 30th, 20072©2005-2007 Carlos GuestrinThus far this semester Regression: Classification: Density estimation:3©2005-2007 Carlos GuestrinLearning to act Reinforcementlearning An agent Makes sensorobservations Must select action Receives rewards positive for “good”states negative for “bad”states[Ng et al. ’05]4©2005-2007 Carlos GuestrinLearning to play backgammon[Tesauro ’95] Combines reinforcementlearning with neural networks Played 300,000 games againstitself Achieved grandmaster level!5©2005-2007 Carlos GuestrinRoadmap to learning aboutreinforcement learning When we learned about Bayes nets: First talked about formal framework: representation inference Then learning for BNs For reinforcement learning: Formal framework Markov decision processes Then learning6©2005-2007 Carlos GuestrinpeasantfootmanbuildingReal-time Strategy GamePeasants collect resources and buildFootmen attack enemiesBuildings train peasants and footmen7©2005-2007 Carlos GuestrinStates and actions State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents8©2005-2007 Carlos GuestrinStates change over time Like an HMM, state changes overtime Next state depends on current stateand action selected e.g., action=“build castle” likely to leadto a state where you have a castle Transition model: Dynamics of the entire system P(x’|x,a)9©2005-2007 Carlos GuestrinSome states and actions arebetter than others Each state x is associated with areward positive reward for successful attack negative for loss Reward function: Total reward R(x)10©2005-2007 Carlos GuestrinDiscounted RewardsAn assistant professor gets paid, say, 20K per year.How much, in total, will the A.P. earn in their life?20 + 20 + 20 + 20 + 20 + … = InfinityWhat’s wrong with this argument?$ $11©2005-2007 Carlos GuestrinDiscounted Rewards“A reward (payment) in the future is not worth quite asmuch as a reward now.” Because of chance of obliteration Because of inflationExample:Being promised $10,000 next year is worth only 90% as much asreceiving $10,000 right now.Assuming payment n years in future is worth only (0.9)n ofpayment now, what is the AP’s Future Discounted Sum ofRewards ?12©2005-2007 Carlos GuestrinDiscount FactorsPeople in economics and probabilistic decision-making dothis all the time.The “Discounted sum of future rewards” using discountfactor γ” is (reward now) + γ (reward in 1 time step) + γ 2 (reward in 2 time steps) + γ 3 (reward in 3 time steps) +:: (infinite sum)13©2005-2007 Carlos GuestrinThe Academic LifeDefine:VA = Expected discounted future rewards starting in state AVB = Expected discounted future rewards starting in state BVT = “ “ “ “ “ “ “ TVS = “ “ “ “ “ “ “ SVD = “ “ “ “ “ “ “ DHow do we compute VA, VB, VT, VS, VD ?A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf400Assume DiscountFactor γ = 0.90.70.70.60.30.20.20.20.30.60.214©2005-2007 Carlos GuestrinComputing the Future Rewards ofan AcademicAssume DiscountFactor γ = 0.90.7A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf4000.70.60.30.20.20.20.30.60.215©2005-2007 Carlos GuestrinJoint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a1,…, an} for all agents Reward function: Total reward R(x,a) sometimes reward can depend on action Transition model: Dynamics of the entire system P(x’|x,a)Markov Decision Process (MDP) Representation:16©2005-2007 Carlos GuestrinPolicyPolicy: π(x) = aAt state x,action a for allagentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx217©2005-2007 Carlos GuestrinValue of PolicyValue: Vπ(x)Expected long-term rewardstarting from xStart from x0x0R(x0)π(x0)Vπ(x0) = Eπ[R(x0) + γ R(x1) + γ2 R(x2) + γ3 R(x3) + γ4 R(x4) + L]Future rewards discounted by γ 2 [0,1)x1R(x1) x1’’ x1’R(x1’)R(x1’’)π(x1)x2R(x2)π(x2)x3R(x3)π(x3)x4R(x4)π(x1’)π(x1’’)18©2005-2007 Carlos GuestrinComputing the value of a policyVπ(x0) = Eπ[R(x0) + γ R(x1) + γ2 R(x2) + γ3 R(x3) + γ4 R(x4) + L] Discounted value of a state: value of starting from x0 and continuing with policy π from then on A recursion!19©2005-2007 Carlos GuestrinComputing the value of a policy 1 –the matrix inversion approach Solve by simple matrix inversion:20©2005-2007 Carlos GuestrinComputing the value of a policy 2 –iteratively If you have 1000,000 states, inverting a 1000,000x1000,000matrix is hard! Can solve using a simple convergent iterative approach:(a.k.a. dynamic programming) Start with some guess V0 Iteratively say: Vt+1 = R + γ Pπ Vt Stop when ||Vt+1-Vt||1 · ε means that ||Vπ-Vt+1||1 · ε/(1-γ)21©2005-2007 Carlos GuestrinBut we want to learn a PolicyPolicy: π(x) = aAt state x, actiona for all agentsπ(x0) = both peasants get woodx0π(x1) = one peasant builds barrack, other gets gold x1π(x2) = peasants get gold, footmen attackx2 So far, told you how good apolicy is… But how can we choose thebest policy??? Suppose there was only onetime step: world is about to end!!! select action that maximizesreward!22©2005-2007 Carlos GuestrinAnother recursion! Two time steps: address tradeoff good reward now better reward in the future23©2005-2007 Carlos GuestrinUnrolling the recursion Choose actions that lead to best value in the long run Optimal value policy achieves optimal value V*24©2005-2007 Carlos GuestrinBellman equation [Bellman 60] Evaluating policy π: Computing the optimal value V* - Bellman equation!""+=')'(),|'(),(max)(xaxaxxaxx VPRV#25©2005-2007 Carlos GuestrinOptimal Long-term PlanOptimal Policy: π*(x)Optimal valuefunction V*(x)Optimal policy:)a,x(maxarg)x(a!!= Q"!""+=')'(),|'(),(),(xxaxxaxax VPRQ#26©2005-2007 Carlos GuestrinInteresting fact – Unique value Slightly surprising fact: There is only


View Full Document

CMU CS 10701 - mdps

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download mdps
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view mdps and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view mdps 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?