CS 188: Artificial Intelligence Fall 2006Reinforcement LearningMarkov Decision ProcessesExample: High-LowHigh-LowMDP SolutionsExample Optimal PoliciesStationarityInfinite Utilities?!How (Not) to Solve an MDPUtility of a StatePolicy EvaluationSlide 13Example: GridWorldQ-FunctionsOptimal UtilitiesPractice: Computing ActionsThe Bellman EquationsSlide 19Value IterationExample: Bellman UpdatesExample: Value IterationConvergence*Policy IterationSlide 25ComparisonNext ClassCS 188: Artificial IntelligenceFall 2006Lecture 9: MDPs9/26/2006Dan Klein – UC BerkeleyReinforcement Learning[DEMOS]Basic idea:Receive feedback in the form of rewardsAgent’s utility is defined by the reward functionMust learn to act so as to maximize expected rewardsChange the rewards, change the behaviorExamples:Playing a game, reward at the end for winning / losingVacuuming a house, reward for each piece of dirt picked upAutomated taxi, reward for each passenger deliveredMarkov Decision ProcessesMarkov decision processes (MDPs)A set of states s SA model T(s,a,s’) = P(s’ | s,a)Probability that action a in state s leads to s’A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one)A start state (or distribution)Maybe a terminal stateMDPs are the simplest case of reinforcement learningIn general reinforcement learning, we don’t know the model or the reward functionExample: High-LowThree card types: 2, 3, 4Infinite deck, twice as many 2’sStart with 3 showingAfter each card, you say “high” or “low”New card is flippedIf you’re right, you win the points shown on the new cardTies are no-opsIf you’re wrong, game ends2324High-LowStates: 2, 3, 4, doneActions: High, LowModel: T(s, a, s’):P(s’=done | 4, High) = 3/4P(s’=2 | 4, High) = 0P(s’=3 | 4, High) = 0P(s’=4 | 4, High) = 1/4 P(s’=done | 4, Low) = 0P(s’=2 | 4, Low) = 1/2P(s’=3 | 4, Low) = 1/4P(s’=4 | 4, Low) = 1/4 …Rewards: R(s, a, s’):Number shown on s’ if s s’0 otherwiseStart: 3Note: could choose actions with search. How?4MDP SolutionsIn deterministic single-agent search, want an optimal sequence of actions from start to a goalIn an MDP, like expectimax, want an optimal policy (s)A policy gives an action for each stateOptimal policy maximizes expected utility (i.e. expected rewards) if followedDefines a reflex agentOptimal policy when R(s, a, s’) = -0.04 for all non-terminals sExample Optimal PoliciesR(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01StationarityIn order to formalize optimality of a policy, need to understand utilities of reward sequencesTypically consider stationary preferences:Theorem: only two ways to define stationary utilitiesAdditive utility:Discounted utility:Assuming that reward depends only on state for these slides!Infinite Utilities?!Problem: infinite state sequences with infinite rewardsSolutions:Finite horizon:Terminate after a fixed T stepsGives nonstationary policy ( depends on time left)Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)Discounting: for 0 < < 1Smaller means smaller horizonHow (Not) to Solve an MDPThe inefficient way:Enumerate policiesFor each one, calculate the expected utility (discounted rewards) from the start stateE.g. by simulating a bunch of runsChoose the best policyMight actually be reasonable for High-Low…We’ll return to a (better) idea like this laterUtility of a StateDefine the utility of a state under a policy:V(s) = expected total (discounted) rewards starting in s and following Recursive definition (one-step look-ahead):Policy EvaluationIdea one: turn recursive equations into updatesIdea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)Example: High-LowPolicy: always say “high”Iterative updates:Example: GridWorld[DEMO]Q-FunctionsTo simplify things, introduce a q-value, for a state and action under a policyUtility of taking starting in state s, taking action a, then following thereafterOptimal UtilitiesGoal: calculate the optimal utility of each stateV*(s) = expected (discounted) rewards with optimal actionsWhy: Given optimal utilities, MEU tells us the optimal policyPractice: Computing ActionsWhich action should we chose from state s:Given optimal q-values Q?Given optimal values V?The Bellman EquationsDefinition of utility leads to a simple relationship amongst optimal utility values:Optimal rewards = maximize over first action and then follow optimal policyFormally:Example: GridWorldValue IterationIdea:Start with bad guesses at all utility values (e.g. V0(s) = 0)Update all values simultaneously using the Bellman equation (called a value update or Bellman update):Repeat until convergenceTheorem: will converge to unique optimal valuesBasic idea: bad guesses get refined towards optimal valuesPolicy may converge long before values doExample: Bellman UpdatesExample: Value IterationInformation propagates outward from terminal states and eventually all states have correct value estimates[DEMO]Convergence*Define the max-norm:Theorem: For any two approximations U and VI.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solutionTheorem:I.e. one the change in our approximation is small, it must also be close to correctPolicy IterationAlternate approach:Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture)Policy improvement: update policy based on resulting converged utilitiesRepeat until policy convergesThis is policy iterationCan converge faster under some conditionsPolicy IterationIf we have a fixed policy , use simplified Bellman equation to calculate utilities:For fixed utilities, easy to find the best action according to one-step look-aheadComparisonIn value iteration:Every pass (or “backup”) updates both policy (based on current utilities) and utilities (based on current policyIn policy iteration:Several passes to update utilitiesOccasional passes to update
View Full Document