Berkeley COMPSCI 188 - Reinforcement Learning - D2471616

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 188> Reinforcement Learning

DOC PREVIEW

Berkeley COMPSCI 188 - Reinforcement Learning

School name University of California, Berkeley

Course Compsci 188- Introduction to Artificial Intelligence

Pages 2

This preview shows page 1 out of 2 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 2 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Reinforcement LearningSection Handout – CS 188December 6, 20051 The Cookie GameOur agent loves cookies! She loves to smell them and eat them. However, She is strictly forbiddenfrom her prize. If she gets caught smelling or eating a cookie, she gets locked in the dungeon forthe rest of the day. Every day, however, she ends up smelling the cookies once again.State: Dungeon Smell Cookie Eat CookieType: Terminal Start TerminalReward: -100 +1 +20Utility(A):Approx: -10 +5 +5Utility(B):Figure 1: The cookie game as an MDPWhen smelling the cookie, she can either keep smelling (K) or grab one (G). The distributionof states she will end up in is as follows:• K: (Dungeon 0.1) (Smell Cookie 0.9)• G: (Dungeon 0.5) (Eat Cookie 0.5)Question: She considers two policies, always keep smelling (A) or always try to grab one (B).Which is better? What is the utility of each state under each policy?12 Expectimax Search DecisionThe agent doesn’t know about value iteration. Instead, she uses expectimax depth-1 search to makeher policy decision each day. She approximates the utility of each state using a simple linear featurefunction:• Feature: Are there cookies? Weight: +5• Feature: Is there a dungeon? Weight: -10Questions: What policy does she choose? Can you adjust the weights so she chooses otherwise?3 Temporal Difference LearningLet’s divide up our feature space such that each state has its own binary feature. Initial weightsare (-10, 5, 5) for (inDungeon, Smelling, Eating). Now, each weight corresponds to an approximateutility for a particularly state. The key to learning quickly is the Bellman equation for a policy π:Uπ(s) = R(s) + γXs0T (s, π, s0)Uπ(s0) (1)We update our weights acc ording to a variant on this equation that involves only two s tates thatare actually observed in some trial of the game:Uπ(s) ← Uπ(s) + α(R(s) + γUπ(s0) − Uπ(s)) (2)3.1 Passive l ear ningThe goal in passive learning is to learn the true utility of states under a particular policy, which isprovided. Passive learning doesn’t explicitly change the agent’s behavior.Question: With the fixed expectimax policy in section 2, will the agent learn correct utilities?3.2 Active learningWhen the agent actually uses the learned utilities to make a decision at each point, we have activelearning.Question: How should the agent use its learned utilities?Question: With an active learning policy, will the agent learn correct utilities? Will it also settleupon the optimal policy?3.3 Function approximationFor linear approximations, we have a rather simple generalization to the standard TD update:θi← θi+ α(R(s) + γˆUθ(s0)

View Full Document

Berkeley COMPSCI 188 - Reinforcement Learning

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 2 pages.

Berkeley COMPSCI 188 - Reinforcement Learning

Sign up for free to view:

Please select your school