UMD CMSC 421 - Markov Decision Processes - D2325231

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 421> Markov Decision Processes

DOC PREVIEW

UMD CMSC 421 - Markov Decision Processes

School name University of Maryland, College Park

Course Cmsc 421- Introduction to Artificial Intelligence

Pages 34

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 34 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS421: Intro to AI1 Hal Daumé III ([email protected])Markov Decision ProcessesHal Daumé IIIComputer ScienceUniversity of [email protected] 421: Introduction to Artificial Intelligence21 Feb 2012Many slides courtesy ofDan Klein, Stuart Russell,or Andrew MooreCS421: Intro to AI2 Hal Daumé III ([email protected]) Announcements➢Mid-course corrections:➢http://u.hal3.name/ic.pl?q=midcourseCS421: Intro to AI3 Hal Daumé III ([email protected])Reinforcement Learning➢Basic idea:➢Receive feedback in the form of rewards➢Agent’s utility is defined by the reward function➢Must learn to act so as to maximize expected rewards➢Change the rewards, change the learned behavior➢Examples:➢Playing a game, reward at the end for winning / losing➢Vacuuming a house, reward for each piece of dirt picked up➢Automated taxi, reward for each passenger deliveredCS421: Intro to AI4 Hal Daumé III ([email protected])Human Reinforcement LearningCS421: Intro to AI5 Hal Daumé III ([email protected])Markov Decision Processes➢An MDP is defined by:➢A set of states s ∈ S➢A set of actions a ∈ A➢A transition function T(s,a,s’)➢Prob that a from s leads to s’➢i.e., P(s’ | s,a)➢Also called the model➢A reward function R(s, a, s’) ➢Sometimes just R(s) or R(s’)➢A start state (or distribution)➢Maybe a terminal state➢MDPs are a family of non-deterministic search problems➢Reinforcement learning: MDPs where we don’t know the transition or reward functionsCS421: Intro to AI6 Hal Daumé III ([email protected])Map 0: Would you go across the top?➢Start in top-right, +$1 for top left, -$1 for red squares➢Costs N cents per step➢For what value N would you risk the “high road”?➢Write something between 1 cent and 12 centsu.hal3.name/ic.pl?q=mapCS421: Intro to AI7 Hal Daumé III ([email protected])Map 1: Would you go across the top?➢Start in top-right, +$1 for top left, -$1 for red squares➢Costs N cents per step➢For what value N would you risk the “high road”?➢Write something between 1 cent and 12 centsu.hal3.name/ic.pl?q=mapCS421: Intro to AI8 Hal Daumé III ([email protected])Solving MDPs➢In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal➢In an MDP, we want an optimal policy π(s)➢A policy gives an action for each state➢Optimal policy maximizes expected if followed➢Defines a reflex agentOptimal policy when R(s, a, s’) = -0.04 for all non-terminals sCS421: Intro to AI9 Hal Daumé III ([email protected])Example Optimal PoliciesR(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01CS421: Intro to AI10 Hal Daumé III ([email protected])Example: High-Low➢Three card types: 2, 3, 4➢Infinite deck, twice as many 2’s➢Start with 3 showing➢After each card, you say “high” or “low”➢New card is flipped➢If you’re right, you win the points shown on the new card➢Ties are no-ops➢If you’re wrong, game ends➢Differences from expectimax: ➢#1: get rewards as you go➢#2: you might play forever!2324CS421: Intro to AI11 Hal Daumé III ([email protected])High-Low➢States: 2, 3, 4, done➢Actions: High, Low➢Model: T(s, a, s’):➢P(s’=done | 4, High) = 3/4➢P(s’=2 | 4, High) = 0➢P(s’=3 | 4, High) = 0➢P(s’=4 | 4, High) = 1/4 ➢P(s’=done | 4, Low) = 0➢P(s’=2 | 4, Low) = 1/2➢P(s’=3 | 4, Low) = 1/4➢P(s’=4 | 4, Low) = 1/4 ➢…➢Rewards: R(s, a, s’):➢Number shown on s’ if s ≠ s’➢0 otherwise➢Start: 3Note: could choose actions with search. How?4CS421: Intro to AI12 Hal Daumé III ([email protected])Example: High-Low3LowHigh243High Low High LowHighLow3, Low, High3T = 0.5 R = 2T = 0.25 R = 3T = 0 R = 4T = 0.25 R = 0CS421: Intro to AI13 Hal Daumé III ([email protected])MDP Search Trees➢Each MDP state gives an expectimax-like search treeass’s, a(s,a,s’) called a transitionT(s,a,s’) = P(s’|s,a)R(s,a,s’)s,a,s’s is a state(s, a) is a q-stateCS421: Intro to AI14 Hal Daumé III ([email protected])Utilities of Sequences➢In order to formalize optimality of a policy, need to understand utilities of sequences of rewards➢Typically consider stationary preferences:➢Theorem: only two ways to define stationary utilities➢Additive utility:➢Discounted utility:Assuming that reward depends only on state for these slides!CS421: Intro to AI15 Hal Daumé III ([email protected])Infinite Utilities?!➢Problem: infinite sequences with infinite rewards➢Solutions:➢Finite horizon:➢Terminate episodes after a fixed T steps➢Gives nonstationary policy (π depends on time left)➢Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)➢Discounting: for 0 < γ < 1➢Smaller γ means smaller “horizon” – shorter term focusCS421: Intro to AI16 Hal Daumé III ([email protected])Discounting➢Typically discount rewards by γ < 1 each time step➢Sooner rewards have higher utility than later rewards➢Also helps the algorithms convergeCS421: Intro to AI17 Hal Daumé III ([email protected])Optimal Utilities➢Fundamental operation: compute the optimal utilities of states s (all at once)➢Why? Optimal values define optimal policies!➢Define the utility of a state s:V*(s) = expected return starting in s and acting optimally➢Define the utility of a q-state (s,a):Q*(s,a) = expected return starting in s, taking action a and thereafter acting optimally➢Define the optimal policy:π*(s) = optimal action from state sass, as,a,s’s’CS421: Intro to AI18 Hal Daumé III ([email protected])The Bellman Equations➢Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values:Optimal rewards = maximize over first action and then follow optimal policy➢Formally:ass, as,a,s’s’CS421: Intro to AI19 Hal Daumé III ([email protected])Solving MDPs➢We want to find the optimal policy π*➢Proposal 1: modified expectimax search, starting from each state s:ass, as,a,s’s’CS421: Intro to AI20 Hal Daumé III ([email protected])Why Not Search Trees?➢Why not solve with expectimax?➢Problems:➢This tree is usually infinite (why?)➢Same states appear over and over (why?)➢We would search once per state (why?)➢Idea: Value iteration➢Compute optimal values for all states all at once using successive approximations➢Will be a bottom-up dynamic program similar in cost to memoization➢Do all planning offline, no replanning needed!CS421: Intro to AI21 Hal Daumé III ([email protected])Value Estimates➢Calculate estimates