CS 188: Artificial Intelligence Spring 2006Reinforcement LearningMarkov Decision ProcessesMDP SolutionsExample Optimal PoliciesStationarityHow (Not) to Solve an MDPUtilities of StatesInfinite Utilities?!The Bellman EquationExample: Bellman EquationsValue IterationExample: Bellman UpdatesExample: Value IterationConvergence*Policy IterationPolicy EvaluationPolicy ImprovementComparisonNext ClassCS 188: Artificial IntelligenceSpring 2006Lecture 21: MDPs4/6/2006Dan Klein – UC BerkeleyReinforcement Learning[Demos]Basic idea:Receive feedback in the form of rewardsMust learn to act so as to maximize expected rewardsAgent’s utility is defined by the reward functionChange the rewards, change the behavior!Examples:Playing a game, reward at the end for winning / losingVacuuming a house, reward for each piece of dirt picked upAutomated taxi, reward for each passenger deliveredMarkov Decision ProcessesMarkov decision processes (MDPs)A set of states s SA model T(s,a,s’) = P(s’ | s,a)Probability that action a in state s leads to s’A reward function R(s) (or R(s,a,s’) )MDPs are the simplest case of reinforcement learningIn general reinforcement learning, we don’t know the model or the reward functionMDP SolutionsIn state-space search, want an optimal sequence of actions from start to a goalIn an MDP, want an optimal policy (s)A policy gives an action for each stateOptimal policy is the one which maximizes expected utility (i.e. expected rewards) if followedGives a reflex agent!Optimal policy when R(s) = -0.04:Example Optimal PoliciesR(s) = -2.0R(s) = -0.4R(s) = -0.03R(s) = -0.01StationarityIn order to formalize optimality of a policy, need to understand utilities of reward sequencesTypically consider stationary preferences:Theorem: only two ways to define stationary utilitiesAdditive utility:Discounted utility:How (Not) to Solve an MDPThe inefficient way:Enumerate policiesCalculate the expected utility (discounted rewards) starting from the start stateE.g. by simulating a bunch of runsChoose the best policyWe’ll return to a (better) idea like this laterUtilities of StatesIdea: calculate the utility (value) of each stateU(s) = expected (discounted) sum of rewards assuming optimal actionsGiven the utilities of states, MEU tells us the optimal policyInfinite Utilities?!Problem: infinite state sequences with infinite rewardsSolutions:Finite horizon:Terminate after a fixed T stepsGives nonstationary policy ( depends on time left)Absorbing state(s): guarantee that for every policy, agent will eventually “die”Discounting: for 0 < < 1Smaller means smaller horizonThe Bellman EquationDefinition of state utility leads to a simple relationship amongst utility values:Expected rewards = current reward + x expected sum of rewards after taking best actionFormally:Example: Bellman EquationsValue IterationIdea:Start with bad guesses at utility values (e.g. U0(s) = 0)Update using the Bellman equation (called a value update or Bellman update):Repeat until convergenceTheorem: will converge to unique optimal valuesBasic idea: bad guesses get refined towards optimal valuesPolicy may converge before values doExample: Bellman UpdatesExample: Value IterationInformation propagates outward from terminal states and eventually all states have correct value estimates[DEMO]Convergence*Define the max-norm:Theorem: For any two approximations U and VI.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solutionTheorem:I.e. one the change in our approximation is small, it must also be close to correctPolicy IterationAlternate approach:Policy evaluation: calculate utilities for a fixed policyPolicy improvement: update policy based on resulting utilitiesRepeat until convergenceThis is policy iterationCan converge faster under some conditionsPolicy EvaluationIf we have a fixed policy , use simplified Bellman equation to calculate utilities:Policy ImprovementFor fixed utilities, easy to find the best action according to one-step lookaheadComparisonIn value iteration:Every pass (or “backup”) updates both policy (based on current utilities) and utilities (based on current policyIn policy iteration:Several passes to update utilitiesOccasional passes to update policiesHybrid approaches (asynchronous policy iteration):Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely oftenNext ClassIn real reinforcement learning:Don’t know the reward function R(s)Don’t know the model T(s,a,s’)So can’t do Bellman updates!Need new techniques:Q-learningModel learningAgents actually have to interact with the environment rather than simulate
View Full Document