Unformatted text preview:

Robot Example Imagine a robot with only local sensing CS 416 Artificial Intelligence Traveling from A to B Actions have uncertain results might move at right angle to desired We want robot to learn how to navigate in this room Lecture 19 Making Complex Decisions Chapter 17 Sequential Decision Problem Similar to 15 puzzle problem How about other search techniques How is this similar and different from 15 puzzle Genetic Algorithms Let robot position be the blank tile Keep issuing movement commands Eventually a sequence of commands will cause robot to reach goal Let each gene be a sequence of L R U D Length unknown Poor feedback Simulated annealing Our model of the world is incomplete Markov decision processes MDP Building a policy Initial State How might we acquire and store a solution S0 Is this a search problem Isn t everything Transition Model Avoid local mins Avoid dead ends Avoid needless repetition T s a s How does Markov apply here Uncertainty is possible Reward Function R s Key observation if the number of states is small consider evaluating states rather than evaluating action sequences For each state Page 1 Building a policy Using a policy Specify a solution for any initial state An agent in state s Construct a policy that outputs the best action for any state s is the percept available to agent s outputs an action that maximizes expected utility policy policy in state s s Complete policy covers all potential input states Optimal policy yields the highest expected utility The policy is a description of a simple reflex Why expected Transitions are stochastic Example solutions Striking a balance Different policies demonstrate balance between risk and reward Typos in book Only interesting in stochastic environments not deterministic Characteristic of many real world problems Building the optimal policy is the hard part Attributes of optimality Time horizon We wish to find policy that maximizes the utility of agent during lifetime Consider spot 3 1 Maximize U s0 s1 s2 sn But is length of lifetime known Finite horizon number of state transitions is known After timestep N nothing matters U s0 s1 s2 sn U s0 s1 s2 sn sn 1 sn k for all k 0 Infinite horizon always opportunity for more state transitions Let horizon 3 Let horizon 8 Let horizon 20 Let horizon inf Does change Nonstationary optimal policy Page 2 Evaluating state sequences Evaluating infinite horizons Assumption How can we compute the sum of infinite horizon If I say I will prefer state a to state b tomorrow I must also say I prefer state a to state b today State preferences are stationary U a b c R a R b R c If discount factor is less than 1 Additive Rewards U a b c R a R b R c Discounted Rewards note Rmax is finite by definition of MDP U a b c R a R b 2R c is the discount factor between 0 and 1 What does this mean Evaluating a policy Evaluating infinite horizons Each policy generates multiple state sequences How can we compute the sum of infinite horizon Uncertainty in transitions according to T s a s If the agent is guaranteed to end up in a terminal state eventually Policy value is an expected sum of discounted rewards observed over all possible state sequences We ll never actually have to compare infinite strings of states We can allow to be 1 Building an optimal policy Utility of states Value Iteration The utility of a state s is Calculate the utility of each state Use the state utilities to select an optimal action in each state the expected utility of the state sequences that might follow it The subsequent state sequence is a function of s The utility of a state given policy is Your policy is simple go to the state with the best utility Your state utilities must be accurate Through an iterative process you assign correct values to the state utility values Page 3 Example Restating the policy Let 1 and R s 0 04 Notice I had said you go to state with highest utility Actually Utilities higher near goal reflecting fewer 0 04 steps in sum Go to state with maximum expected utility Reachable state with highest utility may have low probability of being obtained Function of available actions transition function resulting states Putting pieces together What a deal We said the utility of a state was Much cheaper to evaluate The policy is maximum expected utility Instead of Therefore utility of a state is the immediate reward for that state and expected utility of next state Richard Bellman invented the top equation Bellman equation 1957 Using Bellman Equations to solve MDPs Example of Bellman Equation Revisit 4x3 example Utility at cell 1 1 Consider a particular MDP n possible states n Bellman equations one for each state n equations have n unknowns U s for each state n equations and n unknowns I can solve this right No because of nonlinearity caused by argmax We ll use an iterative technique Consider all outcomes of all possible actions to select best action and assign its expected utility to value of next state in Bellman equation Page 4 Iterative solution of Bellman equations Bellman Update Iterative updates look like this Start with arbitrary initial values for state utilities Update the utility of each state as a function of its neighbors After infinite Bellman updates we are guaranteed to reach an equilibrium that solves Bellman equations The solutions are unique The corresponding policy is optimal Repeat this process until an equilibrium is reached Sanity check utilities for states near goal will settle quickly and their neighbors in turn will settle Information is propagated through state space via local updates Convergence of value iteration Convergence of value iteration How close to optimal policy am I at after i Bellman updates Book shows how to calculate error at time i as a function of the error at time i 1 and discount factor Mathematically rigorous due to contraction functions Policy Iteration Policy iteration Imagine someone gave you a policy Checking a policy Just for kicks let s compute a utility at this particular iteration of the policy i for each state according to Bellman s equation How good is it Assume we know and R Eyeball it Try a few paths and see how it works Let s be more precise Page 5 Policy iteration Policy iteration Checking a policy Checking a policy But we don t know Ui s No problem Now we know U s for all s For each s compute n Bellman equations n unknowns equations are linear This is the best action If this action is different from policy update the policy We can solve for the n


View Full Document

UVA CS 416 - Lecture 19 Making Complex Decisions

Download Lecture 19 Making Complex Decisions
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 19 Making Complex Decisions and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 19 Making Complex Decisions and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?