DOC PREVIEW
Berkeley COMPSCI 188 - Lecture 22: Reinforcement Learning II

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 188: Artificial IntelligenceSpring 2006Lecture 22: Reinforcement Learning II4/13/2006Dan Klein – UC BerkeleyToday Reminder: P3 lab Friday, 2-4pm, 275 Soda Reinforcement learning Temporal-difference learning Q-learning Function approximation2Recap: Passive Learning Learning about an unknown MDP Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s) You DO know the policy π(s) Goal: learn the state values (and maybe the model) Last time: try to learn T, R and then solve as a known MDPModel-Free Learning Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs3Example: Passive TD(1,1) -1 up(1,2) -1 up(1,2) -1 up(1,3) -1 right(2,3) -1 right(3,3) -1 right(3,2) -1 up(3,3) -1 right(4,3) +100(1,1) -1 up(1,2) -1 up(1,3) -1 right(2,3) -1 right(3,3) -1 right(3,2) -1 up(4,2) -100Take γ = 1, α = 0.1(Greedy) Active Learning In general, want to learn the optimal policy Idea: Learn an initial model of the environment: Solve for the optimal policy for this model (value or policy iteration) Refine model through experience and repeat4Example: Greedy Active Learning Imagine we find the lower path to the good exit first Some states will never be visited following this policy from (1,1) We’ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy ??What Went Wrong? Problem with following optimal policy for current model: Never learn about better regions of the space Fundamental tradeoff: exploration vs. exploitation Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility Exploitation: once the true optimal policy is learned, exploration reduces utility Systems must explore in the beginning and exploit in the limit??5Q-Functions Alternate way to learn: Utilities for state-action pairs rather than states AKA Q-functionsLearning Q-Functions: MDPs Just like Bellman updates for state values: For fixed policy π For optimal policy Main advantage of Q-functions over values U is that you don’t need a model for learning or action selection!6Q-Learning Model free, TD learning with Q-functions:Example [DEMOS]7Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? Will take an non-optimal long route to reduce risk which stems from exploration actions! Solution: lower ε over timeESESExploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g.8Function Approximation Problem: too slow to learn each state’s utility one by one Solution: what we learn about one state should generalize to similar states Very much like supervised learning If states are treated entirely independently, we can only learn on very small state spacesDiscretization Can put states into buckets of various sizes E.g. can have all angles between 0 and 5 degrees share the same Q estimate Buckets too fine ⇒ takes a long time to learn Buckets too coarse ⇒ learn suboptimal, often jerky control Real systems that use discretizationusually require clever bucketing schemes Adaptive sizes Tile coding [DEMOS]9Linear Value Functions Another option: values are linear functions of features of states (or action-state pairs) Good if you can describe states well using a few features (e.g. for game playing board evaluations) Now we only have to learn a few weights rather than a value for each state0.600.700.80 0.850.65 0.700.800.900.750.850.95TD Updates for Linear Values Can use TD learning with linear values (Actually it’s just like the perceptron!) Old Q-learning update: Simply update weights of features in


View Full Document

Berkeley COMPSCI 188 - Lecture 22: Reinforcement Learning II

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Lecture 22: Reinforcement Learning II
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 22: Reinforcement Learning II and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 22: Reinforcement Learning II 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?