CS188: Artificial Intelligence, Fall 2010Written 2: MDPs, RL, and ProbabilityDue: Thursday 10/21 in 283 Soda Drop Box by 11:59pm (no slip days)Policy: Can be solved in groups (acknowledge collaborators) but must be written up individually.1 Mission to Mars (10 points)You control a solar-powered Mars rover. It can at any time drivefast or slow. You get a reward for the distance crossed, so fastgives +10 while slow gives +4. Your rover can be in one of threestates: cool, warm, or off. Driving fast tends to heat up the rover,while driving slow tends to cool it down. If the rover overheats, itshuts off, forever. The transitions are shown to the right. Becausecritical research depends on the observations of the rover, there isa discount of 0.9.s a s0T (s, a, s0)cool slow cool 1cool fast cool 1/4cool fast warm 3/4warm slow cool 1/4warm slow warm 3/4warm fast warm 7/8warm fast off 1/8(a) (1 pt) How many possible deterministic policies are there?(b) (1 pt) What is the value of the state cool under the policy that always goes slow?(c) (1 pt) Fill in the following table of depth-limited values from value iteration for this MDP. Note that thispart concerns is (optimal) value iteration, not evaluation of the always-slow policy.s V0(s) V1(s) V2(s)cool 0warm 0off 0 0 0(d) (1 pt) How many rounds of value iteration will it take for the values of all states to converge exactly?(State none if you think it will never converge.)1(e) (1 pt) What is the optimal policy?s π∗(s)coolwarm(f) (1 pt) What are the optimal values?s V∗(s)coolwarmoff 0(g) (1 pt) Central command tells you that the reward sequence [10, 4, 4] (here 10 is the first reward, then 4,then 4) is preferred to the sequence [4, 10, 10]. What ranges of the discount would be consistent with thesepreferences?(h) (1 pt) Now imagine that you do not know in advance what the thermal responses of the rover will be, soyou decide to do Q-learning. You observe the following sequence of transitions:1. (cool, slow, 4) → cool2. (cool, fast, 10) → cool3. (cool, fast, 10) → cool4. (cool, fast, 10) → warm5. (warm, slow, 4) → coolGive the Q-values for each step in this sequence as it is processed by Q-learning, assuming a learning rate(alpha) of 0.5. For example, Q3(s, a) should be the Q-values after processing transitions 1, 2, and 3.s a Q0(s, a) Q1(s, a) Q2(s, a) Q3(s, a) Q4(s, a) Q5(s, a)cool slow 0cool fast 0warm slow 0warm fast 02(i) (2 pt) An epsilon-greedy policy may not be the right choice here, given that the rover, once off, is lostforever. On the other hand, it may not be optimal to never risk going fast from a warm state – perhaps theplanet is very cold and there is little risk. Imagine that you know that T(cool,fast,warm) = T(warm,fast,off)for all environments. Note: this property is not true for the transitions above!State a modified Q-learning update and procedure that exploits this knowledge and from which you will learnall optimal Q-values without ever visiting the Q-state (warm,fast), assuming you do visit all other Q-statesinfinitely often. Be precise (i.e. use math).32 Of Weddings and Bears (7 points)You are the ruler of a distant kingdom, and you have just gotten engaged to your beloved. You have set a datefor the wedding, but you now need to send a message to your soon-to-be in-laws. Hopefully, they will attendthe wedding (A = +a), but they may not (A = ¬a).You dispatch your smartest and fastest messenger across the barren wastes. Unfortunately, there are a numberof things that could go wrong. Your messenger could be beset by a pack of wild bears (B = +b). Also, yourmessenger may be captured (C = +c) by a cohort of cave trolls. It is even possible that both ills could havebefallen the messenger! Using your knowledge about the dangers of bears, of cave trolls, and of your in-laws,you have tallied out the following joint probability table:A B C P (A, B, C)+a +b +c 0.0000+a +b ¬c 0.0008+a ¬b +c 0.0396+a ¬b ¬c 0.6336¬a +b +c 0.0020¬a +b ¬c 0.0072¬a ¬b +c 0.1584¬a ¬b ¬c 0.1584(a) (1 pt) What is the distribution P (B, C)? Your answer should be in the form of a table.(b) (1 pt) Are B and C independent? Justify your answer using the actual probabilities computed in part (a).(c) (1 pt) Because you are also a naturalist, you are curious as to what the probability of a bear attack is inyour model. What is P (B), the marginal distribution over B given no evidence? (This is sometimes called theprior distribution, here the prior distribution of a bear attack.)(d) (1 pt) If your in-laws do not attend the wedding (A = ¬a), what is the conditional distribution over bearattack, P (B|A = ¬a)? (This is often called a posterior distribution.) Does it make intuitive sense how theanswers have shifted from part (c)?4(e) (1 pt) Suppose you further learn that your messenger was captured by cave trolls. What is the newposterior distribution P (B|A = ¬a, C = +c)? Is the conditional probability of bear attack higher or lower thanfrom part (d)? Does this make intuitive sense? The general phenomenon where the discovery that one possiblecause is true decreases belief in other possible causes is called explaining-away.Now, suppose you are making your own decision (D) about whether (+d) or not (¬d) to host the wedding atall. You have no evidence about B or C, but your decision crucially depends on A. If both you and your in-lawsattend the wedding, then everyone will be happy. If only you attend, you will be greatly embarrassed, perhapsto the point that you will have to declare war on your in-laws just to save face! However, if only your in-lawsshow up, they will be extraordinarily embarrassed–as will you–and there will certainly be a large and costlywar. Finally, if neither of you attend the wedding, then there will only be slight embarrassment of having toreschedule.Based on all these considerations, you create the utility function (U) below, which relates the amount of embar-rassment suffered under the possible outcomes. (Note that a 1 on the U scale is the amount of embarrassmentone suffers when you talk to someone while you have lettuce in your teeth for 1 hour straight.)A D U+a +d -1.0+a ¬d -10000.0¬a +d -5000.0¬a ¬d -10.0(f) (1 pt) Given this chart and the probabilities above, what is the expected utility E[U|D = d] if you decideto attend and the expected utility E[U|D = ¬d] if you decide not to attend ?(g) (1 pt) Now suppose that, before you must make your decision, you learn that your messenger was besetby bears (B =
View Full Document