CMU CS 15780 - Markov decision processes (MDPs) - D1646805

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15780> Markov decision processes (MDPs)

CMU CS 15780 - Markov decision processes (MDPs)

Course Cs 15780- Graduate Artificial Intelligence

Pages 33

Download Save

Unformatted text preview:

15-780: Graduate Artificial IntelligenceMarkov decision processes (MDPs)What’s missing in HMMs• HMMs cannot model important aspect of agent interactions:- No model for rewards- No model for actions which can affect these rewards• These are actually issues that are faced by many applications:- Agents negotiating deals on the web- A robot which interacts with its environmentExample: No actionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead00.60.7 0.90.20.10.1 0.110.60.10.60.10.20.30.30.1Formal definition of MDPs• A set of states {s1…sn}• A set of rewards {r1 …rn} • A set of action {a1.. am}• Transition probabilityOne reward for each stateNumber of actions could be larger than number of states)&|(1, kttjtkjiahiqsqPP ====+Questions• What is my expected pay if I am in state i• What is my expected pay if I am in state i and perform action a?Solving MDPs• No actions: Value iterations• With actions: Policy iterationValue computation• An obvious question for such models is what is combined expected value for each state• What can we expect to earn over our life time if we become Asst. prof.?• What is it if we go to industry?Before we answer this question, we need to define a model for future rewards:• The value of a current award is higher than the value of future awards- Inflation, confidence- Example: LotteryDiscounted awards• The discounted award model is specified using a parameter γ• Total awards = current award +γ (award at time t+1) +γ2(award at time t+2) +………….γk(award at time t+k) +infinite sumDiscounted awards• The discounted award model is specified using a parameter γ• Total awards = current award +γ (award at time t+1) +γ2(award at time t+2) +………….γk(award at time t+k) +infinite sumConverges of sum if 0<γ<1Determining the total awards in a state • Define J*(si) = expected discounted sum of awards when starting at state si• How do we compute J*(si)?))(*)(*)(*()(*2211 niniiiiisJpsJpsJprXrsJL+++=+=γγHow can we solve this?Computing j*(si) ))(*)(*)(*()(*121211111 nnsJpsJpsJprsJ L+++=γ))(*)(*)(*()(*222212122 nnsJpsJpsJprsJ L+++=γ))(*)(*)(*()(*2211 nnnnnnnsJpsJpsJprsJ L+++=γ• We have n equations with n unknowns • Can be solved in close formIterative approaches• Solving in closed form is possible, but may be time consuming.• It also doesn’t generalize to non-linear models• Alternatively, this problem can be solved in an iterative manner• Lets define Jt(si) as the expected discounted awards after k steps• How can we compute Jt(si)?iirSJ =)(1⎟⎠⎞⎜⎝⎛+=∑kkkiiisJprSJ )()(1,2γ⎟⎠⎞⎜⎝⎛+=∑+kktkiiitsJprSJ )()(,1γIterative approaches• Solving in closed form is possible, but may be time consuming.• Alternatively, this problem can be solved in an iterative manner• Lets define Jk(si) as the expected discounted awards after k steps• How can we compute Jk(si)?iirSJ =)(1⎟⎠⎞⎜⎝⎛+=∑kkkiiisJprSJ )()(1,2γ⎟⎠⎞⎜⎝⎛+=∑+kktkiiitsJprSJ )()(,1γWe know how to solve this!Lets fill the dynamic programming tableBut wait …This is a never ending task!When do we stop?iirSJ =)(1⎟⎠⎞⎜⎝⎛+=∑kkkiiisJprSJ )()(1,2γ⎟⎠⎞⎜⎝⎛+=∑+kktkiiitsJprSJ )()(,1γRemember, we have a converging functionWe can stop when |Jt-1(si)- Jt(si)|∞< εInfinity norm selects maximal elementExample for γ=0.9Graduate student20Asst. prof40Google200Dead00.90.80.20.10.110.6tJt(Gr) Jt(P) Jt(Goo) Jt(D)1 20 40 200 02 74 87 362 03 141 135 493 04 209 182 600 00.20.1Solving MDPs• No actions: Value iterations• With actions: Policy iteration√Adding actionsA Markov Decision Process:• A set of states {s1…sn}• A set of rewards {r1 …rn} • A set of action {a1.. am}• Transition probability)&|(1, kttjtkjiahiqsqPP ====+Example: ActionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead0Action A0.7 0.90.30.20.110.10.60.70.10.30.30.1Action B0.60.10.10.8Action A: Leave to GoogleAction B: Stay in academiaQuestions for MDPs• Now we have actions• The question changes to the following:Given our current state and the possible actions, what is the best action for us in terms of long term payment?Example: ActionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead0Action A0.7 0.90.30.20.110.10.30.30.10.60.70.1Action B0.60.10.1So should you leave now (right after class) or should you stay in the PhD program?0.8Action A: Leave to GoogleAction B: Stay in academiaPolicyGr BGo AAsst. Pr. ATen. Pr. B• A policy maps sates to actions• An optimal policy leads to the highest expected returns• Note that this does not depend on the start stateSolving MDPs with actions• It could be shown that for every MDP there exists an optimal policy (we won’t discuss the proof).• Such policy guarantees that there is no other action that is expected to yield a higher payoffComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkijas the probability of transitioning from state i to state j when using action k• Then we compute:⎟⎟⎠⎞⎜⎜⎝⎛+=∑+jjtkikitsJprSJji)()(,max1γAlso known as Bellman’s equationComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkijas the probability of transitioning from state i to state j when using action k• Then we compute:⎟⎟⎠⎞⎜⎜⎝⎛+=∑+jjtkikitsJprSJji)()(,max1γRun until convergencesValue iteration for γ=0.9Graduate student20Asst. prof40Google200Dead00.90.80.10.110.7tJt(Gr) Jt(P) Jt(Goo) Jt(D)1 20 40 200 02168(A)51(B)87 362 03311(A)120(B)135 493 04431(A)189(B)182 600 00.3Action BAction A0.80.20.1Computing the optimal policy: 2. Policy iteration• We can also compute optimal policies by revising an existing policy.• We initially select a policy at random (mapping from states to actions). • We re-compute the expected long terms reward at each state using the selected policy• We select a new policy using the expected rewrads and iterate until convergencesPolicy iteration: algorithm•Let πt(si) be the selected policy at time t1. Randomly chose π02. For each state sicompute J*(si), the long term expected reward using policy πt.3. Set π0(si) = 4. Convergence? Yes – output policy. No – go to 2. ⎟⎟⎠⎞⎜⎜⎝⎛+∑jjkiksJprji)(*,maxγPolicy iteration: algorithm•Let

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU CS 15780 - Markov decision processes (MDPs)

Sign up for free to view:

Please select your school