Reinforcement LearningSection Handout – CS 188December 6, 20051 The Cookie GameOur agent loves cookies! She loves to smell them and eat them. However, She is strictly forbiddenfrom her prize. If she gets caught smelling or eating a cookie, she gets locked in the dungeon forthe rest of the day. Every day, however, she ends up smelling the cookies once again.State: Dungeon Smell Cookie Eat CookieType: Terminal Start TerminalReward: -100 +1 +20Utility(A):Approx: -10 +5 +5Utility(B):Figure 1: The cookie game as an MDPWhen smelling the cookie, she can either keep smelling (K) or grab one (G). The distributionof states she will end up in is as follows:• K: (Dungeon 0.1) (Smell Cookie 0.9)• G: (Dungeon 0.5) (Eat Cookie 0.5)Question: She considers two policies, always keep smelling (A) or always try to grab one (B).Which is better? What is the utility of each state under each policy?12 Expectimax Search DecisionThe agent doesn’t know about value iteration. Instead, she uses expectimax depth-1 search to makeher policy decision each day. She approximates the utility of each state using a simple linear featurefunction:• Feature: Are there cookies? Weight: +5• Feature: Is there a dungeon? Weight: -10Questions: What policy does she choose? Can you adjust the weights so she chooses otherwise?3 Temporal Difference LearningLet’s divide up our feature space such that each state has its own binary feature. Initial weightsare (-10, 5, 5) for (inDungeon, Smelling, Eating). Now, each weight corresponds to an approximateutility for a particularly state. The key to learning quickly is the Bellman equation for a policy π:Uπ(s) = R(s) + γXs0T (s, π, s0)Uπ(s0) (1)We update our weights acc ording to a variant on this equation that involves only two s tates thatare actually observed in some trial of the game:Uπ(s) ← Uπ(s) + α(R(s) + γUπ(s0) − Uπ(s)) (2)3.1 Passive l ear ningThe goal in passive learning is to learn the true utility of states under a particular policy, which isprovided. Passive learning doesn’t explicitly change the agent’s behavior.Question: With the fixed expectimax policy in section 2, will the agent learn correct utilities?3.2 Active learningWhen the agent actually uses the learned utilities to make a decision at each point, we have activelearning.Question: How should the agent use its learned utilities?Question: With an active learning policy, will the agent learn correct utilities? Will it also settleupon the optimal policy?3.3 Function approximationFor linear approximations, we have a rather simple generalization to the standard TD update:θi← θi+ α(R(s) + γˆUθ(s0)
View Full Document