New version page

# Berkeley CS 188 - Section Handout 4 Solutions

Pages: 11
Documents in this Course

## This preview shows page 1-2-3-4 out of 11 pages.

View Full Document

End of preview. Want to read all 11 pages?

View Full Document
Unformatted text preview:

CS 188Fall 2019 Section Handout 4 SolutionsThe preamble is an abbreviation of the lecture notesMarkov Decision ProcessesA Markov Decision Process is defined by several properties:• A set of states S• A set of actions A.• A start state.• Possibly one or more terminal states.• Possibly a discount factor γ.• A transition function T (s, a, s0).• A reward function R(s, a, s0).The Bellman Equation• V∗(s) – the optimal value of s is the expected value of the utility an optimally-behaving agent that startsin s will receive, over the rest of the agent’s lifetime.• Q∗(s, a) - the optimal value of (s, a) is the expected value of the utility an agent receives after starting ins, taking a, and acting optimally henceforth.Using these two new quantities and the other MDP quantities discussed earlier, the Bellman equation is definedas follows:V∗(s) = maxaXs0T (s, a, s0)[R(s, a, s0) + γV∗(s0)]We can also define he equation for the optimal value of a q-state (more commonly known as an optimal q-value):Q∗(s, a) =Xs0T (s, a, s0)[R(s, a, s0) + γV∗(s0)]which allows us to reexpress the Bellman equation asV∗(s) = maxaQ∗(s, a).1Value IterationThe time-limited value for a state s with a time-limit of k timesteps is denoted Vk(s), and represents themaximum expected utility attainable from s given that the Markov decision process under consideration ter-minates in k timesteps. Equivalently, this is what a depth-k expectimax run on the search tree for a MDPreturns.Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit to computetime-limited values until convergence (that is, until the V values are the same for each state as they were in thepast iteration: ∀s, Vk+1(s) = Vk(s)). It operates as follows:1. ∀s ∈ S, initialize V0(s) = 0. This should be intuitive, since setting a time limit of 0 timesteps means noactions can be taken before termination, and so no rewards can be acquired.2. Repeat the following update rule until convergence:∀s ∈ S, Vk+1(s) ← maxaXs0T (s, a, s0)[R(s, a, s0) + γVk(s0)]At iteration k of value iteration, we use the time-limited values for with limit k for each state to generatethe time-limited values with limit (k + 1). In essence, we use computed solutions to subproblems (all theVk(s)) to iteratively build up solutions to larger subproblems (all the Vk+1(s)); this is what makes valueiteration a dynamic programming algorithm.21 MDPs: Micro-BlackjackIn micro-blackjack, you repeatedly draw a card (with replacement) that is equally likely to be a 2, 3, or 4. Youcan either Draw or Stop if the total score of the cards you have drawn is less than 6. If your total score is 6 orhigher, the game ends, and you receive a utility of 0. When you Stop, your utility is equal to your total score(up to 5), and the game ends. When you Draw, you receive no utility. There is no discount (γ = 1). Let’sformulate this problem as an MDP with the following states: 0, 2, 3, 4, 5 and a Done state, for when the gameends.(a) What is the transition function and the reward function for this MDP? The transition function isT (s, Stop, Done) = 1T (0, Draw, s0) = 1/3 for s0∈ {2, 3, 4}T (2, Draw, s0) = 1/3 for s0∈ {4, 5, Done}T (3, Draw, s0) =1/3 if s0= 52/3 if s0= DoneT (4, Draw, Done) = 1T (5, Draw, Done) = 1T (s, a, s0) = 0 otherwiseThe reward function isR(s, Stop, Done) = s, s ≤ 5R(s, a, s0) = 0 otherwise(b) Fill in the following table of value iteration values for the first 4 iterations.States 0 2 3 4 5V00 0 0 0 0V10 2 3 4 5V23 3 3 4 5V310/3 3 3 4 5V410/3 3 3 4 5(c) You should have noticed that value iteration converged above. What is the optimal policy for the MDP?States 0 2 3 4 5π∗Draw Draw Stop Stop Stop(d) Perform one iteration of policy iteration for one step of this MDP, starting from the fixed policy below:States 0 2 3 4 5πiDraw Stop Draw Stop DrawVπi2 2 0 4 0πi+1Draw Stop Stop Stop Stop3Q2. MDPs: Grid-World Water Par kConsider the MDP drawn below. The state space consists of all squares in a grid-world water park. There isa single waterslide that is composed of two ladder squares and two slide squares (marked with vertical barsand squiggly lines respectively). An agent in this water park can move from any square to any neighboringsquare, unless the current square is a slide in which case it must move forward one square along the slide. Theactions are denoted by arrows between squares on the map and all deterministically move the agent in the givendirection. The agent cannot stand still: it must move on each time step. Rewards are also shown below: theagent feels great pleasure as it slides down the water slide (+2), a certain amount of discomfort as it climbs therungs of the ladder (-1), and receives rewards of 0 otherwise. The time horizon is infinite; this MDP goes onforever.(a) How many (deterministic) policies π are possible for this MDP?211(b) Fill in the blank cells of this table with values that are correct for the corresponding function, discount,and state. Hint: You should not need to do substantial calculation here.γ s = A s = EV∗3(s) 1.0 0 4V∗10(s) 1.0 2 4V∗10(s) 0.1 0 2.2Q∗1(s, west) 1.0 —— 0Q∗10(s, west) 1.0 —— 3V∗(s) 1.0 ∞ ∞V∗(s) 0.1 0 2.24V∗10(A), γ = 1: In 10 time steps with no discounting, the rewards don’t decay, so the optimal strategy is to climbthe two stairs (-1 reward each), and then slide down the two slide squares (+2 rewards each). You only havetime to do this once. Summing this up, we get −1 − 1 + 2 + 2 = 2.V∗10(E), γ = 1: No discounting, so optimal strategy is sliding down the slide. That’s all you have time for. Sumof rewards = 2 + 2 = 4.V∗10(A), γ = 0.1. The discount rate is 0.1, meaning that rewards 1 step further into the future are discountedby a factor of 0.1. Let’s assume from A, we went for the slide. Then, we would have to take the actionsA → B, B → C, C → D, D → E, E → F, F → G. We get the first -1 reward from C → D, discounted by γ2since it is two actions in the future. D → E is discounted by γ3, E → F by γ4, and F → G by γ5. Since γis low, the positive rewards you get from the slide have less of an effect as the larger negative rewards you getfrom climbing up. Hence, the sum of rewards of taking the slide path would be negative; the optimal value is 0.V∗10(E), γ = 0.1. Now, you don’t have to do the work of climbing up the stairs, and you just take the slide down.Sum of rewards would be

View Full Document Unlocking...