DOC PREVIEW
Berkeley COMPSCI 188 - Reinforcement learning II

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 188 Artificial Intelligence Fall 2009 Lecture 12 Reinforcement Learning II 10 6 2009 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 Reinforcement Learning Reinforcement learning Still assume an MDP A set of states s S A set of actions per state A A model T s a s A reward function R s a s Still looking for a policy s DEMO New twist don t know T or R I e don t know which states are good or what the actions do Must actually try actions and states out to learn 4 The Story So Far MDPs and RL Things we know how to do Techniques If we know the MDP Model based DPs Compute V Q exactly Evaluate a fixed policy If we don t know the MDP We can estimate the MDP then solve We can estimate V for a fixed policy We can estimate Q s a for the optimal policy while executing an exploration policy Value and policy Iteration Policy evaluation Model based RL Model free RL Value learning Q learning 5 Model Free Learning s Model free temporal difference learning Experience world through episodes a s a Update estimates each transition Over time updates will mimic Bellman updates r s Q Value Iteration model based requires known MDP Q Learning model free requires only experienced transitions 6 DEMO Grid Q s Q Learning We d like to do Q value updates to each Q state But can t compute this update without knowing T R Instead compute average as we go Receive a sample transition s a r s This sample suggests But we want to average over results from s a Why So keep a running average 7 DEMO Grid Q s Q Learning Properties Will converge to optimal policy If you explore enough i e visit each q state many times If you make the learning rate small enough Basically doesn t matter how you select actions Off policy learning learns optimal q values not the values of the policy you are following S E S E 8 Exploration Exploitation Several schemes for forcing exploration Simplest random actions greedy Every time step flip a coin With probability act randomly With probability 1 act according to current policy Regret expected gap between rewards during learning and rewards from optimal action Q learning with random actions will converge to optimal values but possibly very slowly and will get low rewards on the way Results will be optimal but regret will be large How to make regret small 9 DEMO Crawler Exploration Functions When to explore Random actions explore a fixed amount Better ideas explore areas whose badness is not yet established explore less over time One way exploration function Takes a value estimate and a count and returns an optimistic utility e g exact form not important 10 DEMO Crawler Q s Q Learning Q learning produces tables of q values 11 Q Learning In realistic situations we cannot possibly learn about every single state Too many states to visit them all in training Too many states to hold the q tables in memory Instead we want to generalize Learn about some small number of training states from experience Generalize that experience to new similar states This is a fundamental idea in machine learning and we ll see it over and over again 12 DEMO RL Pacman Example Pacman Let s say we discover through experience that this state is bad In na ve q learning we know nothing about this state or its q states Or even this one 13 Feature Based Representations Solution describe a state using a vector of features properties Features are functions from states to real numbers often 0 1 that capture important properties of the state Example features Distance to closest ghost Distance to closest dot Number of ghosts 1 dist to dot 2 Is Pacman in a tunnel 0 1 etc Is it the exact state on this slide Can also describe a q state s a with features e g action moves closer to food 14 Linear Feature Functions Using a feature representation we can write a q function or value function for any state using a few weights Advantage our experience is summed up in a few powerful numbers Disadvantage states may share features but actually be very different in value 15 Function Approximation Q learning with linear q functions Exact Q s Approximate Q s Intuitive interpretation Adjust weights of active features E g if something unexpectedly bad happens disprefer all states with that state s features Formal justification online least squares 16 DEMO RL Pacman Example Q Pacman 17 Linear Regression 40 26 24 22 20 20 30 20 0 0 Prediction 20 10 0 0 10 20 30 40 Prediction 18 Ordinary Least Squares OLS Error or residual Observation Prediction 0 0 20 19 Minimizing Error Imagine we had only one point x with features f x Approximate q update explained target prediction 20 Overfitting 30 25 20 Degree 15 polynomial 15 10 5 0 5 10 15 0 2 4 6 8 10 12 14 16 18 20 Policy Search 22 Policy Search Problem often the feature based policies that work well aren t the ones that approximate V Q best E g your value functions from project 2 were probably horrible estimates of future rewards but they still produced good decisions We ll see this distinction between modeling and prediction again later in the course Solution learn the policy that maximizes rewards rather than the value that predicts rewards This is the idea behind policy search such as what controlled the upside down helicopter 23 Policy Search Simplest policy search Start with an initial linear value function or q function Nudge each feature weight up and down and see if your policy is better than before Problems How do we tell the policy got better Need to run many sample episodes If there are a lot of features this can be impractical 24 Policy Search Advanced policy search Write a stochastic soft policy Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w details in the book optional material Take uphill steps recalculate derivatives etc 25 Take a Deep Breath We re done with search and planning Next we ll look at how to reason with probabilities Diagnosis Tracking objects Speech recognition Robot mapping lots more Last part of course machine learning 26


View Full Document

Berkeley COMPSCI 188 - Reinforcement learning II

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Reinforcement learning II
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement learning II and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement learning II 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?