MDPs and Spider SolitaireCS 188: Section HandoutDecember 1, 20051 Defining a Markov Decision Process (MDP)1.1 Fully observable MDPs• Initial State: S0• Transition Model: T (s, a, s0), the probability of going from s to s0with action a.• Reward Function: R(s), the reward for being in state s.1• Discount Factor: γ, the discount for rewards: a reward r in t steps is worth rγtnow.A solution to an MDP is called a policy, which is a function π(s) that maps from states to actions.Thus, for every state, there is one best action championed by a particular policy π.Question 1: What is the transition model for spider solitaire?1.2 Partially observable MPDs (POMDPs)• Observation Model: O(s, o), the probability of observing o in state s.• Belief State Mo del: b(s), the probability of being in state s in belief state b.A solution to a POMDP is also a policy, but it is a function from belief states to actions.Question 2: What is the observation model for spider solitaire?Question 3: What would a policy look like for spider solitaire?Question 4: Sometimes, humans guess when they play solitare. How can we represent guessing?1Sometimes rewards have different structures, such as R(s, a, s0): the reward for moving from s to s0via action a.12 Solving an MDP2.1 Value iteration: an exact solution to MDPsThe quick and dirty story of value iteration:• Solves the Bellman equation: U (s) = R(s) + γ maxaPs0T (s, a, s0)U(s0)• Iterates through each state many times, updating U(s).• Iteration always converges to the correct answer given infinite time.Question 5: Given this short description, why might value iteration not work for spider solitaire?2.2 Expectimax search: an approximate solution to MDPsThe quick and dirty story of expectimax:• Expectimax yields approximate solutions to MDPs with unknown utility functions.• Requires an approximate utility function – preferably one that can be calculated quickly.• Similar to a two-player game tree search, but with the other player making random moves.• When calculating an expectation is too difficult, we can use sampling to approximate.Max nodeExpectationnodeMove Q to KMove 3 to 4Flip a 4Figure 1: An expectimax treeQuestion 6: How many samples do we need to approximate a card draw
View Full Document