10-601Machine LearningMarkov decision processes (MDPs)What’s missing in HMMs• HMMs cannot model important aspects of agent interactions:- No model for rewards- No model for actions which can affect these rewards• These are actually issues that are faced by many applications:- Agents negotiating deals on the web- A robot which interacts with its environmentExample: No actionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead00.60.20.7 0.90.20.10.1 0.110.60.10.30.30.10.60.1Formal definition of MDPs• A set of states {s1… sn}• A set of rewards {r1 … rn} • A set of actions {a1.. am}• Transition probability)&|(1, kttjtkjiahiqsqPP One reward for each stateNumber of actions could be larger than number of statesQuestions• What is my expected pay if I am in state i• What is my expected pay if I am in state i and perform action a?Solving MDPs• No actions: Value iteration• With actions: Value iteration, Policy iterationValue computation• An obvious question for such models is what is combined expected value for each state• What can we expect to earn over our life time if we become Asst. prof.?• What if we go to industry?Before we answer this question, we need to define a model for future rewards:• The value of a current award is higher than the value of future awards- Inflation, confidence- Example: LotteryDiscounted rewards• The discounted rewards model is specified using a parameter • Total rewards = current reward + (reward at time t+1) +2(reward at time t+2) +…k(reward at time t+k) + …infinite sumDiscounted rewards• The discounted rewards model is specified using a parameter • Total rewards = current reward + (reward at time t+1) +2(reward at time t+2) +………….k(reward at time t+k) + …infinite sumConverges if 0<<1Determining the total rewards in a state • Define J*(si) = expected discounted sum of rewards when starting at state si• How do we compute J*(si)?))(*)(*)(*()(*2211 niniiiiisJpsJpsJprXrsJHow can we solve this?Factors expected pay for all possible transitions for step iComputing j*(si) ))(*)(*)(*()(*222212122 nnsJpsJpsJprsJ ))(*)(*)(*()(*121211111 nnsJpsJpsJprsJ • We have n equations with n unknowns • Can be solved in closed form))(*)(*)(*()(*2211 nnnnnnnsJpsJpsJprsJ Iterative approaches• Solving in closed form is possible, but may be time consuming.• It also doesn’t generalize to non-linear models• Alternatively, this problem can be solved in an iterative manner• Lets define Jt(si) as the expected discounted rewards after t steps• How can we compute Jt(si)?iirSJ )(1kkkiiisJprSJ )()(1,2kktkiiitsJprSJ )()(,1Iterative approaches• Solving in closed form is possible, but may be time consuming.• Alternatively, this problem can be solved in an iterative manner• Lets define Jt(si) as the expected discounted awards after t steps• How can we compute Jt(si)?iirSJ )(1kkkiiisJprSJ )()(1,2kktkiiitsJprSJ )()(,1We know how to solve this!Let’s fill the dynamic programming tableBut wait …This is a never ending task!When do we stop?iirSJ )(1kkkiiisJprSJ )()(1,2kktkiiitsJprSJ )()(,1Remember, we have a converging functionWe can stop when |Jt-1(si)- Jt(si)|< Infinity norm selects maximal elementExample for =0.9Graduate student20Asst. prof40Google200Dead00.90.20.80.20.10.110.60.1tJt(Gr) Jt(P) Jt(Goo) Jt(D)1 20 40 200 02 74 87 362 03 141 135 493 04 209 182 600 0J2(Gr)=20+0.9*(0.6*20+0.2*40+0.2*200)Solving MDPs• No actions: Value iteration• With actions: Value iteration, Policy iterationAdding actionsA Markov Decision Process:• A set of states {s1… sn}• A set of rewards {r1 … rn} • A set of actions {a1.. am}• Transition probability)&|(1, kttjtkjiahiqsqPP Example: ActionsGraduate student20Asst. prof40Tenured prof.100Google200On the street0Dead0Action A0.70.7 0.90.30.20.110.10.30.30.10.60.1Action B0.60.10.1Action A: Leave to GoogleAction B: Stay in academia0.8Questions for MDPs• Now we have actions• The question changes to the following:Given our current state and the possible actions, what is the best action for us in terms of long term payment?Example: ActionsAction A: Leave to GoogleAction B: Stay in academiaGraduate student20Asst. prof40Google200Dead00.90.80.10.110.10.70.3Action BAction A0.80.2So should you leave now (right after class) or should you stay at CMU?Policy• A policy maps states to actions• An optimal policy leads to the highest expected returns• Note that this does not depend on the start stateGr BGo AAsst. Pr. ATen. Pr. BSolving MDPs with actions• It could be shown that for every MDP there exists an optimal policy (we won’t discuss the proof).• Such policy guarantees that there is no other action that is expected to yield a higher payoffComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkij as the probability of transitioning from state i to state j when using action k• Then we compute:jjtkikitsJprSJji)()(,max1Also known as Bellman’s equationUse probabilities associated with action kComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkij as the probability of transitioning from state i to state j when using action k• Then we compute:jjtkikitsJprSJji)()(,max1Run until convergenceComputing the optimal policy: 1. Modified value iteration• We can compute it by modifying the value iteration method we discussed.• Define pkij as the probability of transitioning from state i to state j when using action k• Then we compute:jjtkikitsJprSJji)()(,max1• When the algorithm converges, we have computed the best outcome for each state• We associate states with the actions that maximize their returnValue iteration for =0.9Graduate student20Asst. prof40Google200Dead00.90.80.10.110.1tJt(Gr) Jt(P) Jt(Goo) Jt(D)1 20 40 200 02168(A)51(B)87 362 03311(A)120(B)135 493 04431(A)189(B)182 600 00.70.3Action BAction
View Full Document