Markov Decision ProcessesCS 188: Section HandoutDefining a Markov Decision Process (MDP)• State Space: {S0, S1, S2, ...}• Actions: {A0, A1, ...}• Initial State: S0• Transition Model: T (s, a, s0), the probability of going from s to s0with action a.• Reward Function: R(s), the reward for being in state s.1• Discount Factor: γ, the discount for rewards: a reward r in t steps is worth rγtnow. 0 < γ ≤ 1.A solution to an MDP is called a policy, which is a function π(s) that maps from states to actions.For a particular p olicy π, every state has exactly one chosen action.Which of the following are MDPs?Exercise: For each of the following tasks/games, describe an MDP formulation or state why it isnot amenable to the MDP framework.• Blackjack (21) with no betting• Rock, Paper, Scissors• Person trapped in a container ship who can yell for help on sunny days• Tightrope-walking robot1Sometimes rewards have different stru cture s, such as R(s, a, s0): the reward for moving from s to s0via action a.1A Very Simple Example: Golf• State Space: {T ee, F airway, Sand, Green}• Actions: {Conservative, P ower shot}• Initial State: T ee• Transition Model: T (s, a, s0), the probability of going from s to s0with action a.s a s0T (s, a, s0)Tee Conservative Fairway 0.9Tee Conservative Sand 0.1Tee Power shot Green 0.5Tee Power shot Sand 0.5Fairway Conservative Green 0.8Fairway Conservative Sand 0.2Sand Conservative Green 1.0• Reward Function:s R(s)Tee -1Fairway -1Sand -2Green 3Question: For the Conservative policy, what is the utility of being at the Tee? What about thePower shot policy?Value ite rat io n: an exact solution to MDPsThe quick and dirty story of value iteration:• Solves the Bellman equation: U (s) = R(s) + γ maxaPs0T (s, a, s0)U(s0)• Starts withˆU(s) = 0 for all s. Iterates through each state many times, updatingˆU(s).• Iteration always converges to the correct answer given infinite time.Exercise: Compute the value iteration updates for
View Full Document