CS421: Intro to AI1 Hal Daumé III ([email protected])Markov Decision ProcessesHal Daumé IIIComputer ScienceUniversity of [email protected] 421: Introduction to Artificial Intelligence21 Feb 2012Many slides courtesy ofDan Klein, Stuart Russell,or Andrew MooreCS421: Intro to AI2 Hal Daumé III ([email protected]) Announcements➢Mid-course corrections:➢http://u.hal3.name/ic.pl?q=midcourseCS421: Intro to AI3 Hal Daumé III ([email protected])Reinforcement Learning➢Basic idea:➢Receive feedback in the form of rewards➢Agent’s utility is defined by the reward function➢Must learn to act so as to maximize expected rewards➢Change the rewards, change the learned behavior➢Examples:➢Playing a game, reward at the end for winning / losing➢Vacuuming a house, reward for each piece of dirt picked up➢Automated taxi, reward for each passenger deliveredCS421: Intro to AI4 Hal Daumé III ([email protected])Human Reinforcement LearningCS421: Intro to AI5 Hal Daumé III ([email protected])Markov Decision Processes➢An MDP is defined by:➢A set of states s ∈ S➢A set of actions a ∈ A➢A transition function T(s,a,s’)➢Prob that a from s leads to s’➢i.e., P(s’ | s,a)➢Also called the model➢A reward function R(s, a, s’) ➢Sometimes just R(s) or R(s’)➢A start state (or distribution)➢Maybe a terminal state➢MDPs are a family of non-deterministic search problems➢Reinforcement learning: MDPs where we don’t know the transition or reward functionsCS421: Intro to AI6 Hal Daumé III ([email protected])Map 0: Would you go across the top?➢Start in top-right, +$1 for top left, -$1 for red squares➢Costs N cents per step➢For what value N would you risk the “high road”?➢Write something between 1 cent and 12 centsu.hal3.name/ic.pl?q=mapCS421: Intro to AI7 Hal Daumé III ([email protected])Map 1: Would you go across the top?➢Start in top-right, +$1 for top left, -$1 for red squares➢Costs N cents per step➢For what value N would you risk the “high road”?➢Write something between 1 cent and 12 centsu.hal3.name/ic.pl?q=mapCS421: Intro to AI8 Hal Daumé III ([email protected])Solving MDPs➢In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal➢In an MDP, we want an optimal policy π(s)➢A policy gives an action for each state➢Optimal policy maximizes expected if followed➢Defines a reflex agentOptimal policy when R(s, a, s’) = -0.04 for all non-terminals sCS421: Intro to AI9 Hal Daumé III ([email protected])Example Optimal PoliciesR(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01CS421: Intro to AI10 Hal Daumé III ([email protected])Example: High-Low➢Three card types: 2, 3, 4➢Infinite deck, twice as many 2’s➢Start with 3 showing➢After each card, you say “high” or “low”➢New card is flipped➢If you’re right, you win the points shown on the new card➢Ties are no-ops➢If you’re wrong, game ends➢Differences from expectimax: ➢#1: get rewards as you go➢#2: you might play forever!2324CS421: Intro to AI11 Hal Daumé III ([email protected])High-Low➢States: 2, 3, 4, done➢Actions: High, Low➢Model: T(s, a, s’):➢P(s’=done | 4, High) = 3/4➢P(s’=2 | 4, High) = 0➢P(s’=3 | 4, High) = 0➢P(s’=4 | 4, High) = 1/4 ➢P(s’=done | 4, Low) = 0➢P(s’=2 | 4, Low) = 1/2➢P(s’=3 | 4, Low) = 1/4➢P(s’=4 | 4, Low) = 1/4 ➢…➢Rewards: R(s, a, s’):➢Number shown on s’ if s ≠ s’➢0 otherwise➢Start: 3Note: could choose actions with search. How?4CS421: Intro to AI12 Hal Daumé III ([email protected])Example: High-Low3LowHigh243High Low High LowHighLow3, Low, High3T = 0.5 R = 2T = 0.25 R = 3T = 0 R = 4T = 0.25 R = 0CS421: Intro to AI13 Hal Daumé III ([email protected])MDP Search Trees➢Each MDP state gives an expectimax-like search treeass’s, a(s,a,s’) called a transitionT(s,a,s’) = P(s’|s,a)R(s,a,s’)s,a,s’s is a state(s, a) is a q-stateCS421: Intro to AI14 Hal Daumé III ([email protected])Utilities of Sequences➢In order to formalize optimality of a policy, need to understand utilities of sequences of rewards➢Typically consider stationary preferences:➢Theorem: only two ways to define stationary utilities➢Additive utility:➢Discounted utility:Assuming that reward depends only on state for these slides!CS421: Intro to AI15 Hal Daumé III ([email protected])Infinite Utilities?!➢Problem: infinite sequences with infinite rewards➢Solutions:➢Finite horizon:➢Terminate episodes after a fixed T steps➢Gives nonstationary policy (π depends on time left)➢Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)➢Discounting: for 0 < γ < 1➢Smaller γ means smaller “horizon” – shorter term focusCS421: Intro to AI16 Hal Daumé III ([email protected])Discounting➢Typically discount rewards by γ < 1 each time step➢Sooner rewards have higher utility than later rewards➢Also helps the algorithms convergeCS421: Intro to AI17 Hal Daumé III ([email protected])Optimal Utilities➢Fundamental operation: compute the optimal utilities of states s (all at once)➢Why? Optimal values define optimal policies!➢Define the utility of a state s:V*(s) = expected return starting in s and acting optimally➢Define the utility of a q-state (s,a):Q*(s,a) = expected return starting in s, taking action a and thereafter acting optimally➢Define the optimal policy:π*(s) = optimal action from state sass, as,a,s’s’CS421: Intro to AI18 Hal Daumé III ([email protected])The Bellman Equations➢Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values:Optimal rewards = maximize over first action and then follow optimal policy➢Formally:ass, as,a,s’s’CS421: Intro to AI19 Hal Daumé III ([email protected])Solving MDPs➢We want to find the optimal policy π*➢Proposal 1: modified expectimax search, starting from each state s:ass, as,a,s’s’CS421: Intro to AI20 Hal Daumé III ([email protected])Why Not Search Trees?➢Why not solve with expectimax?➢Problems:➢This tree is usually infinite (why?)➢Same states appear over and over (why?)➢We would search once per state (why?)➢Idea: Value iteration➢Compute optimal values for all states all at once using successive approximations➢Will be a bottom-up dynamic program similar in cost to memoization➢Do all planning offline, no replanning needed!CS421: Intro to AI21 Hal Daumé III ([email protected])Value Estimates➢Calculate estimates
View Full Document