DOC PREVIEW
Berkeley COMPSCI 188 - Lecture 11: Reinforcement Learning II

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 188: Artificial IntelligenceFall 2011Lecture 11: Reinforcement Learning II9/29/2011Dan Klein – UC BerkeleyMany slides over the course adapted from either Stuart Russell or Andrew Moore1Reinforcement Learning Reinforcement learning: Still assume an MDP: A set of states s ∈ S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’) Still looking for a policy π(s) New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn[DEMO]2The Story So Far: MDPs and RL If we know the MDP Compute V*, Q*, π* exactly Evaluate a fixed policy π If we don’t know the MDP We can estimate the MDP then solve We can estimate V for a fixed policy π We can estimate Q*(s,a) for the optimal policy while executing an exploration policy3 Model-based DPs Value and policy Iteration Policy evaluation Model-based RL Model-free RL: Value learning Q-learningThings we know how to do:Techniques:Model-Free Learning Model-free (temporal difference) learning Experience world through episodes Update estimates each transition Over time, updates will mimic Bellman updates4ass, as’Q-Value Iteration (model-based, requires known MDP)Q-Learning (model-free, requires only experienced transitions)rQ-Learning We’d like to do Q-value updates to each Q-state: But can’t compute this update without knowing T, R Instead, compute average as we go Receive a sample transition (s,a,r,s’) This sample suggests But we want to average over results from (s,a) (Why?) So keep a running average[DEMO – Grid Q’s]5Q-Learning Properties Will converge to optimal policy If you explore enough (i.e. visit each q-state many times) If you make the learning rate small enough Basically doesn’t matter how you select actions (!) Off-policy learning: learns optimal q-values, not the values of the policy you are followingS ES E[DEMO – Grid Q’s]62Q-Learning Q-learning produces tables of q-values:[DEMO – Crawler Q’s]7Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Regret: expected gap between rewards during learning and rewards from optimal action Q-learning with random actions will converge to optimal values, but possibly very slowly, and will get low rewards on the way Results will be optimal but regret will be large How to make regret small?8Exploration Functions When to explore Random actions: explore a fixed amount Better ideas: explore areas whose badness is not (yet) established, explore less over time One way: exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)9[DEMO – Crawler]Q-Learning In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning, and we’ll see it over and over again11Example: Pacman Let’s say we discover through experience that this state is bad: In naïve q learning, we know nothing about this state or its q states: Or even this one!12[DEMO – RL Pacman]Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) …… etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food)133Linear Feature Functions Using a feature representation, we can write a q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value!14Function Approximation Q-learning with linear q-functions: Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states with that state’s features Formal justification: online least squares15Exact Q’sApproximate Q’sExample: Q-Pacman16[DEMO – RL Pacman]0 2002040010203040010203020222426Linear RegressionPrediction Prediction17Ordinary Least Squares (OLS)0 200Error or “residual”PredictionObservation18Minimizing ErrorApproximate q update explained:19Imagine we had only one point x with features f(x):“target” “prediction”40 2 4 6 8 10 12 14 16 18 20-15-10-5051015202530Degree 15 polynomialOverfitting Policy Search21[DEMO]Policy Search Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions We’ll see this distinction between modeling and prediction again later in the course Solution: learn the policy that maximizes rewards rather than the value that predicts rewards This is the idea behind policy search, such as what controlled the upside-down helicopter22Policy Search Simplest policy search: Start with an initial linear value function or q-function Nudge each feature weight up and down and see if your policy is better than before Problems: How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impractical23Policy Search* Advanced policy search: Write a stochastic (soft) policy: Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (optional material) Take uphill steps, recalculate derivatives, etc.24Take a Deep Breath… We’re done with search and planning! Next, we’ll look at how to reason with probabilities Diagnosis Tracking objects Speech


View Full Document

Berkeley COMPSCI 188 - Lecture 11: Reinforcement Learning II

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Lecture 11: Reinforcement Learning II
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 11: Reinforcement Learning II and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 11: Reinforcement Learning II 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?