DOC PREVIEW
Berkeley COMPSCI 188 - MDPs, RL, and Probability

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS188: Artificial Intelligence, Fall 2009Written 2: MDPs, RL, and ProbabilityDue: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days)Policy: Can be solved in groups (acknowledge collaborators) but must be written up individually.1 Traffic Quest (6 points)You are designing the AI for an automated taxi. Its route graph is madeup of intersection nodes n ∈ N connected by roads r, where traversingeach road r takes time c(r) when the road is clear but time b(r) whenbusy with traffic. For example, the freeways may be fast when clearbut slow in traffic, while surface streets might always be somewhere inbetween. The agent cannot know whether a road it embarks on willbe clear, but it does know that if one road segment is clear (busy), thenext will be clear (busy) with probability p. Similarly, if a road is clear(busy), the next segment will be busy (clear) will probability 1 − p. Anexample is show to the right.(a) (2 points) Assume the agent knows whether it just experienced a busy road. Assume also that thetraffic change probability is known to be 1 − p. Formulate the problem of reaching a fixed destination nodeg in minimal time as an MDP. You may wish to use the notation out(n) (in(n)) for the set of roads leaving(entering) a node n, and end(r) (start(r)) for the end node (start node) of a road r. Answer for the generalcase, not the specific graph above.States:Actions:Transition function:Reward function:Discount:(b) (1 point) In a general map, how many iterations of value iteration will it take before all states havetheir exact values? Give the smallest bound you can and justify your answer.(c) (2 points) Imagine the taxi must reach the airport at g by the flight time f. Now assume that theagent only cares about getting to the airport on time, i.e. on or before f . The agent’s utility should be +100if it is on time and -100 if it is late. How should we alter the original MDP? State the new MDP below.Leave any unchanged aspects blank. Answer for the general case, not the specific graph above.States:Actions:Transition function:Reward function:Discount:(d) (1 point) MDPs which are time-sensitive often give rise to what are called non-stationary policies,where the actions depend on the time remaining. Give an example from the example above where the opti-mal action depends on the time remaining. Clearly state the values, actions, times, etc. involved.22 Reward Shaping (9 points)In an MDP, the instantaneous rewards R from a state are only part of the value V of the state, whichincludes future rewards as well. States may have high outgoing rewards but low value, or vice versa, whichcan impede learning in a reinforcement learning setting. One common technique is called reward shaping,where an agent’s instantaneous rewards are modified to encourage faster learning, usually by “forwarding”future rewards. For example, a taxi might get rewards for each mile closer it gets to its destination.Consider a fixed MDP M with states S, actions A, transitions T , discount γ, and rewards R. An exampleis shown below:In the example, the discount is 1.0 and the rewards are all at the end of the game.Your answers to the next parts should be in the format of brief tables.(a) (1 point) Compute the value of each state under the policy which always chooses action x.(b) (1 point) Compute the optimal value of each state.(c) (1 point) Compute the optimal q-value of each q-state.(d) (1 point) What is the optimal policy?(e) (1 point) If we were q-learning and observed the episodes [a, x, 0, b, y, 6, e] and [a, y, 0, c, x, 8, f], whatwould the final q-values be? Assume a learning rate of 0.5.3Given any MDP, we can compute an equivalent MDP called the pushed MDP MPby changing the rewardsas follows. We define RP(s, a, s0) = Q∗(s, a) − γV∗(s0), where all values are from the original MDP M . Ofcourse, computing the pushed MDP is in general no easier than solving the original MDP.(f) (2 point) Prove that the optimal Q-values in MPwill be the same as for M (in general, not thisgraph in specific). You may assume that there is an ordering on states s such that transitions can only resultin states later in the ordering. Hint: Begin with the Bellman equations which define Q∗in MPand look foropportunities to apply definitions of other quantities.(g) (1 point) If we were q-learning and observed the same transitions as in (e) but in the pushed MDP,what would the modified sequence of rewards be?Learning from a pushed MDP can be faster than learning from the original MDP, though to really improvelearning it is often necessary to alter rewards in ways which change the optimal Q-values. In particular, wecan sometimes “shape” rewards by giving the agent immediate points for doing something we believe to beprogressing in a good way, such as making progress toward a subgoal. That is, we can modify the rewards by“front-loading” them so that moving to high value states gives large immediate rewards rather than simplyaccess to large future rewards.(h) (1 point) In general, if we are doing reinforcement learning, why would we prefer an agent to getfront-loaded instantaneous rewards which are good indicators of total rewards as opposed to getting all re-wards at the end of the game? Give an answer which holds even if the discount is 1.43 Conformant Search, The Sequel (4 points)Consider again an agent in a maze-like grid, as shown to the right. Initially,the agent might be in any location x (including the exit e). The agent canmove in any direction (N, S, E, W ). Moving into a wall is a legal action, butdoes not change the agent’s actual position. Formally, let post(`, d) be thepossibly unchanged location resulting from choosing direction d from location`. Similarly, let pre(`, d) be the possibly empty set of locations from whichchoosing direction d would result in location `. The agent is still trying toreach a designated exit location e where it can be rescued.Imagine now that the agent’s movement actions may fail, causing it to stay in place with probability f. Inthis case, the agent can never be completely sure where it is, no matter what clever actions it takes. Imaginethe agent also has a new action a = Z which signals for pick-up at (it hopes) the exit. The agent can onlyuse this action once, at which point the game ends. The utility for using Z if the agent is actually at theexit location is 0, but -1000 elsewhere.(a) (1 point) Assuming that the agent currently believes itself to be in each


View Full Document

Berkeley COMPSCI 188 - MDPs, RL, and Probability

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download MDPs, RL, and Probability
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view MDPs, RL, and Probability and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MDPs, RL, and Probability 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?