**Unformatted text preview:**

1April 21st, 2002Copyright © 2002, 2004, Andrew W. MooreAndrew W. MooreProfessorSchool of Computer ScienceCarnegie Mellon Universitywww.cs.cmu.edu/[email protected] Systems, Markov Decision Processes, and Dynamic ProgrammingPrediction and Search in Probabilistic WorldsNote to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 2Discounted RewardsAn assistant professor gets paid, say, 20K per year.How much, in total, will the A.P. earn in their life?20 + 20 + 20 + 20 + 20 + … = InfinityWhat’s wrong with this argument?$ $2Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 3Discounted Rewards“A reward (payment) in the future is not worth quite as much as a reward now.”• Because of chance of obliteration• Because of inflationExample:Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now.Assuming payment n years in future is worth only (0.9)nof payment now, what is the AP’s Future Discounted Sum of Rewards ?Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 4Discount FactorsPeople in economics and probabilistic decision-making do this all the time.The “Discounted sum of future rewards” using discount factor γ”is(reward now) +γ (reward in 1 time step) +γ2(reward in 2 time steps) +γ3(reward in 3 time steps) +:: (infinite sum)3Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 5The Academic LifeDefine:JA= Expected discounted future rewards starting in state AJB= Expected discounted future rewards starting in state BJT= “ “ “ “ “ “ “ TJS= “ “ “ “ “ “ “ SJD= “ “ “ “ “ “ “ DHow do we compute JA, JB, JT, JS, JD?A.AssistantProf20B.Assoc.Prof60S.On theStreet10D.Dead0T.TenuredProf400Assume Discount Factor γ= 0.90.70.70.60.30.20.20.20.30.60.2Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 6Computing the Future Rewards of an Academic4Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 7A Markov System with Rewards…• Has a set of states {S1S2·· SN}• Has a transition probability matrixP11P12·· P1NP= P21Pij= Prob(Next = Sj| This = Si):PN1·· PNN• Each state has a reward. {r1r2·· rN}• There’s a discount factor γ . 0 < γ < 1On Each Time Step …0. Assume your state is Si1. You get given reward ri2. You randomly move to another stateP(NextState = Sj| This = Si) = Pij3. All future rewards are discounted by γCopyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 8Solving a Markov SystemWrite J*(Si) = expected discounted sum of future rewards starting in state SiJ*(Si) = ri+ γ x (Expected future rewards starting from your next state)= ri+ γ(Pi1J*(S1)+Pi2J*(S2)+ ··· PiNJ*(SN))Using vector notation writeJ*(S1)r1P11P12·· P1NJ*(S2) r2P21 ·.J= : R=: P= :J*(SN) rNPN1PN2·· PNNQuestion: can you invent a closed form expression for Jin terms of R P and γ ?5Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 9Solving a Markov System with Matrix Inversion• Upside: You get an exact answer• Downside:Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 10Solving a Markov System with Matrix Inversion• Upside: You get an exact answer• Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations.6Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 11Value Iteration: another way to solve a Markov SystemDefine J1(Si) = Expected discounted sum of rewards over the next 1 time step.J2(Si) = Expected discounted sum rewards during next 2 stepsJ3(Si) = Expected discounted sum rewards during next 3 steps:Jk(Si) = Expected discounted sum rewards during next k stepsJ1(Si) = (what?)J2(Si) = (what?):Jk+1(Si) = (what?)Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 12Value Iteration: another way to solve a Markov SystemDefine J1(Si) = Expected discounted sum of rewards over the next 1 time step.J2(Si) = Expected discounted sum rewards during next 2 stepsJ3(Si) = Expected discounted sum rewards during next 3 steps:Jk(Si) = Expected discounted sum rewards during next k stepsJ1(Si) = ri(what?)J2(Si) = (what?):Jk+1(Si) = (what?)∑=+NjjijisJpr11)(γ∑=+NjjkijisJpr1)(γN = Number of states7Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 13Let’s do Value Iteration54321Jk(HAIL)Jk(WIND)Jk(SUN)kSUN5+4WIND30HAIL.::.:.::-81/21/21/21/21/21/2γ = 0.5Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 14Let’s do Value Iteration-11.11-1.524.885-11-1.444.944-10.75-1.2553-10-152-8041Jk(HAIL)Jk(WIND)Jk(SUN)kSUN5+4WIND30HAIL.::.:.::-81/21/21/21/21/21/2γ = 0.58Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 15Value Iteration for solving Markov Systems• Compute J1(Si) for each j• Compute J2(Si) for each j:• Compute Jk(Si) for each jAs k→∞ Jk(Si)→J*(Si) . Why?When to stop? WhenMax Jk+1(Si) – Jk(Si) < ξiThis is faster than matrix inversion (N3style)ifthe transition matrix is sparseCopyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 16A Markov Decision Processγ = 0.9Poor &Unknown+0Rich &Unknown+10Rich &Famous+10Poor &Famous+0You run a startup company.In every state you must choose between Saving money or Advertising.SAASAASS1111/21/21/21/21/21/21/21/21/21/29Copyright © 2002, 2004, Andrew W. Moore Markov Systems: Slide 17Markov Decision ProcessesAn MDP has…• A set of states {s1··· SN}• A set of actions {a1··· aM}• A set of rewards {r1··· rN} (one for each state)• A transition probability functionkijP()kijkijaction use I and ThisNextProbP ===On each step:0. Call current state Si1. Receive reward ri2. Choose action ∈ {a1··· aM}3. If you choose action akyou’ll move to state