DOC PREVIEW
CMU CS 15780 - Markov decision processes (MDPs)

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

15-780: Graduate ArtificialIntelligenceMarkov decision processes (MDPs)What’s missing in HMMs• HMMs cannot model important aspects of agent interactions: - No model for rewards - No model for actions which can affect these rewards• These are actually issues that are faced by many applications: - Agents negotiating deals on the web - A robot which interacts with its environmentExample: No actionsGraduatestudent20Asst. prof40Tenuredprof.100Google200On thestreet0Dead00.60.20.7 0.90.20.10.1 0.110.60.10.30.30.10.60.1Formal definition of MDPs• A set of states {s1 … sn}• A set of rewards {r1 … rn}• A set of action {a1 .. am}• Transition probability)&|(1, kttjtkjiahiqsqPP ====+One reward for each stateNumber of actions could belarger than number of statesQuestions• What is my expected pay if I am in state i• What is my expected pay if I am in state i and performaction a?Solving MDPs• No actions: Value iterations• With actions: Value iteration, Policy iterationValue computation• An obvious question for such models is what iscombined expected value for each state• What can we expect to earn over our life time if webecome Asst. prof.?• What if we go to industry?Before we answer this question, we need to define amodel for future rewards:• The value of a current award is higher than the valueof future awards - Inflation, confidence - Example: LotteryDiscounted rewards• The discounted rewards model is specified using aparameter γ• Total rewards = current reward + γ (reward at time t+1) + γ2 (reward at time t+2) + …………. γk (reward at time t+k) + infinite sumDiscounted awards• The discounted award model is specified using aparameter γ• Total awards = current award + γ (award at time t+1) + γ2 (award at time t+2) + …………. γk (award at time t+k) + infinite sumConverges if 0<γ<1Determining the total rewards in astate• Define J*(si) = expected discounted sum of rewards whenstarting at state si• How do we compute J*(si)?))(*)(*)(*()(*2211 niniiiiisJpsJpsJprXrsJL+++=+=!!How can we solve this?Computing j*(si)))(*)(*)(*()(*222212122 nnsJpsJpsJprsJ L+++=!))(*)(*)(*()(*121211111 nnsJpsJpsJprsJ L+++=!• We have n equations with n unknowns• Can be solved in close form))(*)(*)(*()(*2211 nnnnnnnsJpsJpsJprsJ L+++=!Iterative approaches• Solving in closed form is possible, but may be time consuming.• It also doesn’t generalize to non-linear models• Alternatively, this problem can be solved in an iterative manner• Lets define Jt(si) as the expected discounted rewards after k steps• How can we compute Jt(si)?iirSJ =)(1!"#$%&+='kkkiiisJprSJ )()(1,2(!"#$%&+='+kktkiiitsJprSJ )()(,1(Iterative approaches• Solving in closed form is possible, but may be timeconsuming.• Alternatively, this problem can be solved in an iterativemanner• Lets define Jk(si) as the expected discounted awards after ksteps• How can we compute Jk(si)?iirSJ =)(1!"#$%&+='kkkiiisJprSJ )()(1,2(!"#$%&+='+kktkiiitsJprSJ )()(,1(We know how to solve this!Lets fill the dynamic programming tableBut wait …This is a never ending task!When do we stop?iirSJ =)(1!"#$%&+='kkkiiisJprSJ )()(1,2(!"#$%&+='+kktkiiitsJprSJ )()(,1(Remember, we have a converging functionWe can stop when |Jt-1(si)- Jt(si)|∞ < εInfinity norm selects maximal elementExample for γ=0.9Graduatestudent20Asst. prof40Google200Dead00.90.20.80.20.10.110.60.10493135141306001822094036287742020040201Jt(D)Jt(Goo)Jt(P)Jt(Gr)tSolving MDPs• No actions: Value iterations• With actions: Value iteration, Policy iteration√Adding actionsA Markov Decision Process:• A set of states {s1 … sn}• A set of rewards {r1 … rn}• A set of action {a1 .. am}• Transition probability)&|(1, kttjtkjiahiqsqPP ====+Example: ActionsGraduatestudent20Asst. prof40Tenuredprof.100Google200On thestreet0Dead0Action A0.70.7 0.90.30.20.110.10.30.30.10.60.1ActionB0.60.10.1Action A: Leave toGoogleAction B: Stay inacademia0.8Questions for MDPs• Now we have actions• The question changes to the following:Given our current state and the possible actions, what isthe best action for us in terms of long term payment?Example: ActionsGraduatestudent20Asst. prof40Tenuredprof.100Google200On thestreet0Dead0Action A0.70.7 0.90.30.20.110.10.30.30.10.60.1ActionB0.60.10.1Action A: Leave toGoogleAction B: Stay inacademiaSo should you leave now (rightafter class) or should you stay inthe PhD program?0.8Policy• A policy maps sates toactions• An optimal policy leads tothe highest expectedreturns• Note that this does notdepend on the start stateBTen. Pr.AAsst. Pr.AGoBGrSolving MDPs with actions• It could be shown that for every MDP there exists anoptimal policy (we won’t discuss the proof).• Such policy guarantees that there is no other action thatis expected to yield a higher payoffComputing the optimal policy:1. Modified value iteration• We can compute it by modifying the value iterationmethod we discussed.• Define pkij as the probability of transitioning from state i tostate j when using action k• Then we compute:!!"#$$%&+='+jjtkikitsJprSJji)()(,max1(Also known as Bellman’sequationComputing the optimal policy:1. Modified value iteration• We can compute it by modifying the value iterationmethod we discussed.• Define pkij as the probability of transitioning from state i tostate j when using action k• Then we compute:!!"#$$%&+='+jjtkikitsJprSJji)()(,max1(Run until convergencesComputing the optimal policy:1. Modified value iteration• We can compute it by modifying the value iterationmethod we discussed.• Define pkij as the probability of transitioning from state i tostate j when using action k• Then we compute:!!"#$$%&+='+jjtkikitsJprSJji)()(,max1(• When the algorithm converges, we havecomputed the best outcome for each state• We associate states with the actions thatmaximize their returnValue iteration for γ=0.9Graduatestudent20Asst. prof40Google200Dead00.90.80.10.110.10493135311(A)120(B)30600182431(A)189(B)4036287168(A)51(B)2020040201Jt(D)Jt(Goo)Jt(P)Jt(Gr)t0.70.3ActionBAction A0.80.2Computing the optimal policy:2. Policy iteration• We can also compute optimal policies by revising anexisting policy.• We initially select a policy at random (mapping fromstates to actions).• We re-compute the expected long term reward at eachstate using the selected policy• We select a new policy using the expected


View Full Document

CMU CS 15780 - Markov decision processes (MDPs)

Download Markov decision processes (MDPs)
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Markov decision processes (MDPs) and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Markov decision processes (MDPs) 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?