DOC PREVIEW
Berkeley COMPSCI 188 - Reinforcement Learning

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Reinforcement LearningSection Handout – CS 188December 6, 20051 The Cookie GameOur agent loves cookies! She loves to smell them and eat them. However, She is strictly forbiddenfrom her prize. If she gets caught smelling or eating a cookie, she gets locked in the dungeon forthe rest of the day. Every day, however, she ends up smelling the cookies once again.State: Dungeon Smell Cookie Eat CookieType: Terminal Start TerminalReward: -100 +1 +20Utility(A):Approx: -10 +5 +5Utility(B):Figure 1: The cookie game as an MDPWhen smelling the cookie, she can either keep smelling (K) or grab one (G). The distributionof states she will end up in is as follows:• K: (Dungeon 0.1) (Smell Cookie 0.9)• G: (Dungeon 0.5) (Eat Cookie 0.5)Question: She considers two policies, always keep smelling (A) or always try to grab one (B).Which is better? What is the utility of each state under each policy?12 Expectimax Search DecisionThe agent doesn’t know about value iteration. Instead, she uses expectimax depth-1 search to makeher policy decision each day. She approximates the utility of each state using a simple linear featurefunction:• Feature: Are there cookies? Weight: +5• Feature: Is there a dungeon? Weight: -10Questions: What policy does she choose? Can you adjust the weights so she chooses otherwise?3 Temporal Difference LearningLet’s divide up our feature space such that each state has its own binary feature. Initial weightsare (-10, 5, 5) for (inDungeon, Smelling, Eating). Now, each weight corresponds to an approximateutility for a particularly state. The key to learning quickly is the Bellman equation for a policy π:Uπ(s) = R(s) + γXs0T (s, π, s0)Uπ(s0) (1)We update our weights acc ording to a variant on this equation that involves only two s tates thatare actually observed in some trial of the game:Uπ(s) ← Uπ(s) + α(R(s) + γUπ(s0) − Uπ(s)) (2)3.1 Passive l ear ningThe goal in passive learning is to learn the true utility of states under a particular policy, which isprovided. Passive learning doesn’t explicitly change the agent’s behavior.Question: With the fixed expectimax policy in section 2, will the agent learn correct utilities?3.2 Active learningWhen the agent actually uses the learned utilities to make a decision at each point, we have activelearning.Question: How should the agent use its learned utilities?Question: With an active learning policy, will the agent learn correct utilities? Will it also settleupon the optimal policy?3.3 Function approximationFor linear approximations, we have a rather simple generalization to the standard TD update:θi← θi+ α(R(s) + γˆUθ(s0)


View Full Document

Berkeley COMPSCI 188 - Reinforcement Learning

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?