New version page

# Berkeley CS 188 - Section Handout 5 Solutions

Pages: 5
Documents in this Course

## This preview shows page 1-2 out of 5 pages.

View Full Document

End of preview. Want to read all 5 pages?

View Full Document
Unformatted text preview:

CS 188Fall 2019 Section Handout 5 SolutionsQ1. MDPPacman is using MDPs to maximize his expected utility. In each environment:• Pacman has the standard actions {North, East, South, West} unless blocked by an outer wall• There is a reward of 1 point when eating the dot (for example, in the grid below, R(C, South, F ) = 1)• The game ends when the dot is eaten(a) Consider a the following grid where there is a single food pellet in the bottom right corner (F ). Thediscount factor is 0.5. There is no living reward. The states are simply the grid locations.A B C D E F A B C D E (i) What is the optimal policy for each state?State π(state)AEast orSouthBEast orSouthC SouthD EastE East(ii) What is the optimal value for the state of being in the upper left corner (A)? Reminder: the discountfactor is 0.5.V∗(A) = 0.25k V(A) V(B) V(C) V(D) V(E) V(F)0 0 0 0 0 0 01 0 0 1 0 1 02 0 0.5 1 0.5 1 03 0.25 0.5 1 0.5 1 04 0.25 0.5 1 0.5 1 0(iii) Using value iteration with the value of all states equal to zero at k=0, for which iteration k willVk(A) = V∗(A)?k = 3 (see above)1(b) Consider a new Pacman level that begins with cherries in locations D and F . Landing on a grid positionwith cherries is worth 5 points and then the cherries at that position disappear. There is still one dot,worth 1 point. The game still only ends when the dot is eaten.A B C D E F A B C D E F C A D B E F (i) With no discount (γ = 1) and a living reward of -1, perform one iteration of policy iteration in thislevel’s state space, starting from the fixed policy below.State A CD,FCherry=trueD,FCherry=falseE,FCherry=trueE,FCherry=falseFπiEast East North North West West WestVπi0 4 0 0 -1 -1 -2πi+1South East North North East West WestLarger state spaces with equivalent states and actions are possible too. For example with the staterepresentation of (grid, D-cherry, F-cherry), there could be up to 24 different states, where all fourwith A are the same, etc.(ii) With no discount (γ = 1), what is the range of living reward values such that Pacman eats exactlyone cherry when starting at position A?Valid range for the living reward is (-2.5,-1.25).Let x equal the living reward.The reward for eating zero cherries {A,B} is x + 1 (one step plus food).The reward for eating exactly one cherry {A,C,D,B} is 3x + 6 (three steps plus cherry plus food).The reward for eating two cherries {A,C,D,E,F,E,D,B} is 7x + 11 (seven steps plus two cherries plusfood).x must be greater than -2.5 to make eating at least one cherry worth it (3x + 6 > x + 1).x must be less than -1.25 to eat less than one cherry (3x + 6 > 7x + 11).2Q2. Markov Decision ProcessesConsider a simple MDP with two states, S1and S2, two actions, A and B, a discount factor γ of 1/2, rewardfunction R given byR(s, a, s0) =(1 if s0= S1;−1 if s0= S2;and a transition function specified by the following table.s a s0T (s, a, s0)S1A S11/2S1A S21/2S1B S12/3S1B S21/3S2A S11/2S2A S21/2S2B S11/3S2B S22/3(a) Perform a single iteration of value iteration, filling in the resultant Q-values and state values in thefollowing tables. Use the specified initial value function V0, rather than starting from all zero state values.Only compute the entries not labeled “skip”.s a Q1(s, a)S1A 1.25S1B 1.50S2A skipS2B skips V0(s) V1(s)S12 1.50S23 skipQ1(s) =Xs0T (s, a, s0)[R(s, a, s0) + γV0(s0)]V1(s) = maxaQ1(s, a)3(b) Given an arbitrary MDP with state set S, transition function T (s, a, s0), discount factor γ, and rewardfunction R(s, a, s0), and given a constant β > 0, consider a modified MDP (S, T, γ, R0) with reward functionR0(s, a, s0) = β · R(s, a, s0). Prove that the modified MDP (S, T, γ, R0) has the same set of optimal policiesas the original MDP (S, T, γ, R).Vπmodified= β · Vπoriginalsatisfies the Bellman equationβ · Vπoriginal(s) = Vπmodified(s)=Xs0T (s, π(s), s0)[R0(s, π(s), s0) + γ · Vπmodified(s0)]=Xs0T (s, π(s), s0)[β · R(s, π(s), s0) + γ · β · Vπoriginal(s0)]= β ·Xs0T (s, π(s), s0)[R(s, π(s), s0) + γ · Vπoriginal(s0)]= β · Vπoriginal(s0).It follows that for any state s, the set of policies π that maximize Vπoriginalis precisely the same set ofpolicies that maximize Vπmodified.Intuitively, you should understand that scaling the reward function does not affect the arg max thatultimately determines the policy.(c) Although in this class we have defined MDPs as having a reward function R(s, a, s0) that can depend onthe initial state s and the action a in addition to the destination state s0, MDPs are sometimes defined ashaving a reward function R(s0) that depends only on the destination state s0. Given an arbitrary MDPwith state set S, transition function T (s, a, s0), discount factor γ, and reward function R(s, a, s0) that doesdepend on the initial state s and the action a, define an equivalent MDP with state set S0, transitionfunction T0(s, a, s0), discount factor γ0, and reward function R0(s0) that depends only on the destinationstate s0.By equivalent, it is meant that there should be a one-to-one mapping between state-action sequences inthe original MDP and state-action sequences in the modified MDP (with the same value). You do notneed to give a proof of the equivalence.States: S0= S × S × A, where A is the set of actions.Transition function:T0(s, a, s0) =(T (s000, a, s0000) if s = (s00, a0, s000) and s0= (s000, a, s0000);0 otherwise.Discount factor: γ0= γReward function: R0(s0) = R(s, a, s00), where s0= (s, a, s00).4Q3. Direct EvaluationConsider the grid-world given below and an agent who is trying to learn the optimal policy. Rewards are onlyawarded for taking the Exit action from one of the shaded states. Taking this action moves the agent to theDone state, and the MDP terminates. Assume γ = 1 and α = 0.5 for all calculations. All equations need toexplicitly mention γ and α if necessary.(a) The agent starts from the top left corner and you are given the following episodes from runs of the agentthrough this grid-world. Each line in an Episode is a tuple containing (s, a, s0, r).Episode 1 Episode 2 Episode 3 Episode 4 Episode 5(1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0 (1,3), S, (1,2), 0(1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0 (1,2), E, (2,2), 0(2,2), E, (3,2), 0 (2,2), S, (2,1), 0 (2,2), E, (3,2), 0 (2,2), E, (3,2), 0 (2,2), E, (3,2), 0(3,2), N, (3,3), 0 (2,1), Exit, D, -100 (3,2), S, (3,1), 0 (3,2), N, (3,3), 0 (3,2), S, (3,1), 0(3,3), Exit, D, +50

View Full Document Unlocking...