UT Dallas CS 6375 - mdp - D3105472

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6375> mdp

DOC PREVIEW

UT Dallas CS 6375 - mdp

School name University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

11Machine LearningCS6375 --- Spring 2015Markov Decision ProcessesAcknowledge: slides based on Andrew Moore’s MDP tutorial2RewardsAn assistant professor gets paid, say, 20K per year.How much, in total, will the A.P. earn in their life?20 + 20 + 20 + 20 + 20 + … = InfinityWhat’s wrong with this argument?23Discounted Rewards“A reward (payment) in the future is not worth quiteas much as a reward now.”• Because of chance of obliteration• Because of inflationExample:Being promised $10,000 next year is worth only 90% asmuch as receiving $10,000 right now.Assuming payment n years in future is worth only(0.9)nof payment now, what is the AP’s FutureDiscounted Sum of Rewards ?4Discount FactorsPeople in economics and probabilistic decision makingdo this all the time.The Discounted sum of future rewards usingdiscount factor γ is(reward now) +γ (reward in 1 time step) +γ2(reward in 2 time steps) +γ3(reward in 3 time steps) +:: (infinite sum)35Discounting• Always converges if γ < 1 and the reward function, R(.), is bounded• γ close to 0  instant gratification, don’t pay attention to future reward• γ close to 1  extremely conservative, consider profits/losses no matter how far in the future• The resulting model is the discounted reward6The Academic LifeDefine:JA= Expected discounted future rewards starting in state AJB= Expected discounted future rewards starting in state BJT= “ “ “ “ “ “ “ TJS= “ “ “ “ “ “ “ SJD= “ “ “ “ “ “ “ DHow do we compute JA, JB, JT, JS, JD?47A Markov System with Rewards• Has a set of states { S1S2·· SN }• Has a transition probability matrixP11P12·· P1NP= P21Pij= Prob(NextState = Sj| This = Si):PN1·· PNN• Each state has a reward. { r1r2·· rN}• There’s a discount factor γ. 0 < γ < 1On Each Time Step …0. Assume your state is Si1. You get given reward ri2. You randomly move to another stateP(NextState = Sj| This = Si) = Pij3. All future rewards are discounted by γ8Solving a Markov SystemWrite J*(Si) = expected discounted sum of future rewards starting in state SiJ*(Si) = ri+ γ x (Expected future rewards starting from your next state)= ri+ γ(Pi1J*(S1)+Pi2J*(S2)+ ··· PiNJ*(SN))Using vector notation, writeJ*(S1) r1P11P12·· P1NJ = J*(S2) R = r2P= P21·. : : :J*(SN) rNPN1PN2·· PNNQuestion: can you get a closed form expression for J in terms of R, P and γ ?59Solving a Markov System with Matrix Inversion• Upside: You get an exact answer• Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations.JPRJ∗∗+=γ10Value Iteration: another way to solve a M.S.DefineJ1(Si) = Expected discounted sum of rewards over the next 1 time step.J2(Si) = Expected discounted sum of rewards during next 2 stepsJ3(Si) = Expected discounted sum of rewards during next 3 steps:Jk(Si) = Expected discounted sum of rewards during next k steps∑∑=+=+=+==NjjkijiikNjjijiiiisJprSJNsJprSJrSJ111121)()(states of #:)()()(γγ611Let’s do Value Iteration12Value Iteration Example713Value Iteration for Solving Markov Systems• Compute J1(Si) for each j• Compute J2(Si) for each j:• Compute Jk(Si) for each jAs k→∞ Jk(Si)→J*(Si) . When to stop? WhenMax i| Jk+1(Si) – Jk(Si) | < ξThis is faster than matrix inversion if the transition matrix is sparse14Different Types of Markov ModelsState No control controlFully observable Markov systems MDPsHidden HMMs POMDPs815Markov Decision Processes: An Example• I run a startup company• I can choose to either save money or spend money on advertising• If I advertise, I may become famous (50% prob.) but will spend money so I may become poor• If I save money, I may become rich (50% prob.), but I may also become unknown because I don’t advertise• What should I do?16A Markov Decision ProcessYou run a startupcompany.In every state youmust choosebetween savingmoney oradvertising.917Markov Decision ProcessesAn MDP has…• A set of states {s1··· SN}• A set of actions {a1··· aM}• A set of rewards {r1··· rN} (one for each state)• A transition probability functionOn each step:0. Call current state Si1. Receive reward ri2. Choose action ∈ {a1··· aM}3. If you choose action akyou’ll move to state Sjwith probability4. All future rewards are discounted by γ18What we are looking for: PolicyA policy is a mapping from states to actions.• How many policies?• Which one is the best policy and how to compute it?1019Optimal Decision Policy• Policy π = Mapping from states to action π(s) = a Which action should I take in each state?• Intuition: π encodes the best action that we can take from any state to maximize future rewards• Optimal Policy = The policy π* that maximizes the expected utility J(s) of the sequence of states generated by π*, starting at s20Example: Finance and Business• States: Status of the company (cash reserves, inventory, etc.)• Actions: Business decisions (advertise, acquire other companies, roll out product, etc.)• Uncertainty due to all the external uncontrollable factors (economy,shortages, consumer confidence)• Optimal policy: The policy for making business decisions that maximizes the expected future profits1121Example: Robotics• States are 2-D positions• Actions are commanded motions (turn by x degrees, move y meters)• Uncertainty comes from the fact that the mechanism is not perfect (slippage, etc.) and does not execute the commands exactly• Reward when avoiding forbidden regions (for example)• Optimal policy: The policy that minimizes the cost to the goal22Key Results• For every MDP, there exists an optimal policy• There is no better option (in terms of expected sum of rewards) than to follow this policy• How to compute the optimal policy?1223Computing the Optimal PolicyIdea One:– Run through all possible policies.– Select the best.Note: for a given policy, it’s a Markov system – can calculate expected discounted reward.Not tractable.24Computing the Optimal Value Function with Value IterationDefine J*(Si) = Expected Discounted Future Rewards, starting from state Si, assuming we use the optimal policyDefineJn(Si) = Maximum

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 19 pages.

UT Dallas CS 6375 - mdp

Sign up for free to view:

Please select your school