**Unformatted text preview:**

CS 188Fall 2019 Exam Prep 4 Solutions1 MDPABCDEF01000001Consider the MDP above, with states represented as nodes and transitions as edges between nodes. The rewardsfor the transitions are indicated by the numbers on the edges. For example, going from state B to state A givesa reward of 10, but going from state A to itself gives a reward of 0. Some transitions are not allowed, such asfrom state A to state B. Transitions are deterministic (if there is an edge between two states, the agent canchoose to go from one to the other and will reach the other state with probability 1).(a) For this part only, suppose that the max horizon length is 15. Write down the optimal action at each stepif the discount factor is γ = 1.A: Go to AB: Go to CC: Go to DD: Go to EE: Go to FF: Go to F(b) Now suppose that the horizon is infinite. For each state, does the optimal action depend on γ? If so, foreach state, write an equation that would let you determine the value for γ at which the optimal actionchanges.A: Only staying at A is a possible action. For the other states, let n be the number of steps to B, andm be the number of steps to F . Then, the value of going left is 10γnand the value of going right isP∞k=mγk=11−γ−1−γm1−γbecause of the geometric series. Now we find the value of γ at which these areequal.10γn=11 − γ−1 − γm1 − γ=γm1 − γ10 − 10γ = γm−nγm−n+ 10γ − 10 = 0The roots of the above polynomial are the points at which the optimal action changes.1Q2. MDPs: Mini-GridsThe following problems take place in various scenarios of the gridworld MDP (as in Project 3). In all cases, Ais the start state and double-rectangle states are exit states. From an exit state, the only action available isExit, which results in the listed reward and ends the game (by moving into a terminal state X, not shown).From non-exit states, the agent can choose either Left or Right actions, which move the agent in the corre-sponding direction. There are no living rewards; the only non-zero rewards come from exiting the grid.Throughout this problem, assume that value iteration begins with initial values V0(s) = 0 for all states s.First, consider the following mini-grid. For now, the discount is γ = 1 and legal movement actions will alwayssucceed (and so the state transition function is deterministic).(a) What is the optimal value V∗(A)?10Since the discount γ = 1 and there are no rewards for any action other than exiting, a policy that simply headsto the right exit state and exits will accrue reward 10. This is the optimal policy, since the only alternativereward if 1, and so the optimal value function has value 10.(b) When running value iteration, remember that we start with V0(s) = 0 for all s. What is the first iterationk for which Vk(A) will be non-zero?2The first reward is accrued when the agent does the following actions (state transitons) in sequence: Left, Exit.Since two state transitions are necessary before any possible reward, two iterations are necessary for the valuefunction to become non-zero.(c) What will Vk(A) be when it is first non-zero?1As explained above, the first non-zero value function value will come from exiting out of the left exit cell, whichaccrues reward 1.(d) After how many iterations k will we have Vk(A) = V∗(A)? If they will never become equal, write never.4The value function will equal the optimal value function when it discovers this sequence of state transitions:Right, Right, Right, Exit. This will obviously happen in 4 iterations.Now the situation is as before, but the discount γ is less than 1.(e) If γ = 0.5, what is the optimal value V∗(A)?The optimal policy from A is Right, Right, Right, Exit. The rewards accrued by these state transitions are: 0,0, 0, 10. The discount values are γ0, γ1, γ2, γ3, which is 1,12,14,18. Therefore, V∗(A) = 0 + 0 + 0 +108.(f) For what range of values γ of the discount will it be optimal to go Right from A? Remember that0 ≤ γ ≤ 1. Write all or none if all or no legal values of γ have this property.2The best reward accrued with the policy of going left is γ1∗1. The best reward accrued with the policy of goingright is γ3∗ 10. We therefore have the inequality 10γ3≥ γ, which simplifies to γ ≥p1/10. The final answer is1/√10 ≤ γ ≤ 13Q3. Wandering PoetIn country B there are N cities. They are all connected by roads in a circular fashion. City 1 is connected withcity N and city 2. For 2 ≤ i ≤ N − 1, city i is conected with cities i − 1 and i + 1.A wandering poet is travelling around the country and staging shows in its different cities.He can choose to move from a city to a neighboring one by moving East or moving West, or stay in his currentlocation and recite poems to the masses, providing him with a reward of ri. If he chooses to travel from city i,there is a probability 1 − pithat the roads are closed because of B’s dragon infestation problem and he has tostay in his current location. The reward he is to reap is 0 during any successful travel day, and ri/2 when hefails to travel, because he loses only half of the day.(a) Let ri= 1 and pi= 0.5 for all i and let γ = 0.5. For 1 ≤ i ≤ N answer the following questions with realnumbers:Hint: Recall thatP∞j=0uj=11−ufor u ∈ (0, 1).(i) What is the value Vstay(i) under the policy that the wandering poet always chooses to stay?We have that for all i, the Bellman equations for policy evaluation are Vstay(i) = ri+ γVstay(i). Whenri= 1 and pi= 1 this reduces to Vstay(i) = 1 + 0.5Vstay(i) which yields Vstay(i) = 2.(ii) What is the value Vwest(i) of the policy where the wandering poet always chooses west?Vwest(1) = 0.5(12+ 0.5Vwest(1)) + 0.5(0.5Vwest(2)) (1)Since all starting states are equivalent, Vwest(1) = Vwest(2). Therefore Vwest(1) = Vwest(2) = ··· =12.(b) Let N be even, let pi= 1 for all i, and, for all i, let the reward for cities be given asri=(a i is evenb i is odd,where a and b are constants and a > b > 0.(i) Suppose we start at an even-numbered city. What is the range of values of the discount factor γ suchthat the optimal policy is to stay at the current city forever? Your answer may depend on a and b.For all possible values of γ, staying at an even city will be optimal.(ii) Suppose we start at an odd-numbered city. What is the range of values of the discount factor γ suchthat the optimal policy is to stay at the current city forever? Your answer may depend on a and b.4The poet should only move if

View Full Document