CS 188: Artificial Intelligence Fall 2007Markov Decision ProcessesExample: High-LowHigh-LowSlide 5MDP Search TreesUtilities of SequencesInfinite Utilities?!DiscountingEpisodes and ReturnsUtilities under PoliciesPolicy EvaluationSlide 13Q-FunctionsRecap: MDP QuantitiesOptimal UtilitiesThe Bellman EquationsSolving MDPsMDP Search Trees?Value EstimatesMemoized Recursion?Value IterationSlide 23Example: Bellman UpdatesExample: Value IterationConvergence*Policy IterationSlide 28ComparisonCS 188: Artificial IntelligenceFall 2007Lecture 10: MDPs9/27/2007Dan Klein – UC BerkeleyMarkov Decision ProcessesAn MDP is defined by:A set of states s SA set of actions a AA transition function T(s,a,s’)Prob that a from s leads to s’i.e., P(s’ | s,a)Also called the modelA reward function R(s, a, s’) Sometimes just R(s) or R(s’)A start state (or distribution)Maybe a terminal stateMDPs are a family of non-deterministic search problemsReinforcement learning: MDPs where we don’t know the transition or reward functionsExample: High-LowThree card types: 2, 3, 4Infinite deck, twice as many 2’sStart with 3 showingAfter each card, you say “high” or “low”New card is flippedIf you’re right, you win the points shown on the new cardTies are no-opsIf you’re wrong, game endsDifferences from expectimax: #1: get rewards as you go#2: you might play forever!2324High-LowStates: 2, 3, 4, doneActions: High, LowModel: T(s, a, s’):P(s’=done | 4, High) = 3/4P(s’=2 | 4, High) = 0P(s’=3 | 4, High) = 0P(s’=4 | 4, High) = 1/4 P(s’=done | 4, Low) = 0P(s’=2 | 4, Low) = 1/2P(s’=3 | 4, Low) = 1/4P(s’=4 | 4, Low) = 1/4 …Rewards: R(s, a, s’):Number shown on s’ if s s’0 otherwiseStart: 3Note: could choose actions with search. How?4Example: High-Low3HighLow243High Low High LowHighLow3, High, Low3T = 0.5, R = 2T = 0.25, R = 3T = 0, R = 4T = 0.25, R = 0MDP Search TreesEach MDP state gives an expectimax-like search treeass’s, a(s,a,s’) called a transitionT(s,a,s’) = P(s’|s,a)R(s,a,s’)s,a,s’s is a state(s, a) is a q-stateUtilities of SequencesIn order to formalize optimality of a policy, need to understand utilities of sequences of rewardsTypically consider stationary preferences:Theorem: only two ways to define stationary utilitiesAdditive utility:Discounted utility:Assuming that reward depends only on state for these slides!Infinite Utilities?!Problem: infinite sequences with infinite rewardsSolutions:Finite horizon:Terminate after a fixed T stepsGives nonstationary policy ( depends on time left)Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)Discounting: for 0 < < 1Smaller means smaller “horizon” – shorter term focusDiscountingTypically discount rewards by < 1 each time stepSooner rewards have higher utility than later rewardsAlso helps the algorithms convergeEpisodes and ReturnsAn epsiode is a run of an MDPSequence of transitions (s,a,s’)Starts at start stateEnds at terminal state (if it ends)Stochastic!The utility, or return, of an epsiodeThe discounted sum of the rewardsUtilities under PoliciesFundamental operation: compute the utility of a state sDefine the value (utility) of a state s, under a fixed policy :V(s) = expected return starting in s and following Recursive relation (one-step look-ahead):(s)ss, (s)s,a,s’s’Policy EvaluationHow do we calculate values for a fixed policy?Idea one: it’s just a linear system, solve with Matlab (or whatever)Idea two: turn recursive equations into updatesVi(s) = expected returns over the next i transitions while following Equivalent to doing depth i search and plugging in zero at leaves(s)ss, (s)s,a,s’s’Example: High-LowPolicy: always say “high”Iterative updates:[DEMO]Q-FunctionsAlso, define a q-value, for a state and action (q-state)Q(s) = expected return starting in s, taking action a and following thereafterass, as,a,s’s’Recap: MDP QuantitiesReturn = Sum of futurediscounted rewardsin one episode(stochastic)V: Expected return from a state under a policyQ: Expected return from a q-state under a policyass, as,a,s’s’Optimal UtilitiesFundamental operation: compute the optimal utilities of states sDefine the utility of a state s:V*(s) = expected return starting in s and acting optimallyDefine the utility of a q-state (s,a):Q*(s) = expected return starting in s, taking action a and thereafter acting optimallyDefine the optimal policy:*(s) = optimal action from state sass, as,a,s’s’The Bellman EquationsDefinition of utility leads to a simple relationship amongst optimal utility values:Optimal rewards = maximize over first action and then follow optimal policyFormally:ass, as,a,s’s’Solving MDPsWe want to find the optimal policy Proposal 1: modified expectimax search:ass, as,a,s’s’MDP Search Trees?Problems:This tree is usually infinite (why?)The same states appear over and over (why?)There’s actually one tree per state (why?)Ideas:Compute to a finite depth (like expectimax)Consider returns from sequences of increasing lengthCache values so we don’t repeat workValue EstimatesCalculate estimates Vk*(s)Not the optimal value of s!The optimal value considering only next k time steps (k rewards)As k , it approaches the optimal valueWhy:If discounting, distant rewards become negligibleIf terminal states reachable from everywhere, fraction of episodes not ending becomes negligibleOtherwise, can get infinite expected utility and then this approach actually won’t workMemoized Recursion?Recurrences:Cache all function call results so you never repeat workWhat happened to the evaluation function?Value IterationProblems with the recursive computation:Have to keep all the Vk*(s) around all the timeDon’t know which depth k(s) to ask for when planningSolution: value iterationCalculate values for all states, bottom-upKeep increasing k until convergenceValue IterationIdea:Start with V0*(s) = 0, which we know is right (why?)Given Vi*, calculate the values for all states for depth i+1:This is called a value update or Bellman
View Full Document