Berkeley COMPSCI 188 - MDPs, RL, and Probability (7 pages)

Previewing pages 1, 2 of 7 page document View the full content.
View Full Document

MDPs, RL, and Probability



Previewing pages 1, 2 of actual document.

View the full content.
View Full Document
View Full Document

MDPs, RL, and Probability

166 views

Other


Pages:
7
School:
University of California, Berkeley
Course:
Compsci 188 - Introduction to Artificial Intelligence
Introduction to Artificial Intelligence Documents

Unformatted text preview:

CS188 Artificial Intelligence Fall 2010 Written 2 MDPs RL and Probability Due Thursday 10 21 in 283 Soda Drop Box by 11 59pm no slip days Policy Can be solved in groups acknowledge collaborators but must be written up individually 1 Mission to Mars 10 points You control a solar powered Mars rover It can at any time drive fast or slow You get a reward for the distance crossed so fast gives 10 while slow gives 4 Your rover can be in one of three states cool warm or off Driving fast tends to heat up the rover while driving slow tends to cool it down If the rover overheats it shuts off forever The transitions are shown to the right Because critical research depends on the observations of the rover there is a discount of 0 9 s a s0 T s a s0 cool slow cool 1 cool cool fast fast cool warm 1 4 3 4 warm slow cool 1 4 warm slow warm 3 4 warm fast warm 7 8 warm fast off 1 8 a 1 pt How many possible deterministic policies are there b 1 pt What is the value of the state cool under the policy that always goes slow c 1 pt Fill in the following table of depth limited values from value iteration for this MDP Note that this part concerns is optimal value iteration not evaluation of the always slow policy s V0 s cool 0 warm 0 off 0 V1 s V2 s 0 0 d 1 pt How many rounds of value iteration will it take for the values of all states to converge exactly State none if you think it will never converge 1 e 1 pt What is the optimal policy s s cool warm f 1 pt What are the optimal values V s s cool warm off 0 g 1 pt Central command tells you that the reward sequence 10 4 4 here 10 is the first reward then 4 then 4 is preferred to the sequence 4 10 10 What ranges of the discount would be consistent with these preferences h 1 pt Now imagine that you do not know in advance what the thermal responses of the rover will be so you decide to do Q learning You observe the following sequence of transitions 1 2 3 4 5 cool slow 4 cool cool fast 10 cool cool fast 10 cool cool fast 10 warm warm slow 4 cool



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view MDPs, RL, and Probability and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MDPs, RL, and Probability and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?