1CS 188: Artificial IntelligenceFall 2007Lecture 13: Reinforcement Learning10/9/2007Dan Klein – UC BerkeleyReinforcement Learning Reinforcement learning: Still have an MDP: A set of states s ∈ S A set of actions (per state) A A model T(s,a,s’) A reward function R(s,a,s’) Still looking for a policy π(s) New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn Quantities: V(s), Q(s,a) are expected future returns2Q-Learning Learn Q*(s,a) values Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate: Nudge the old estimate towards the new sample:Q-Learning Properties Will converge to optimal policy If you explore enough If you make the learning rate small enough But not decrease it too quickly! Neat property: learns optimal q-values regardless of action selection noise (some caveats)ESES3Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functionsExploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)4Q-Learning Q-learning produces tables of q-values:Q-Learning In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar states This is a fundamental idea in machine learning, and we’ll see it over and over again5Example: Pacman Let’s say we discover through experience that this state is bad: In naïve q learning, we know nothing about this state or its q states: Or even this one!Feature-Based Representations Solution: describe a state using a vector of features Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) …… etc. Can also describe a q-state (s, a) with features (e.g. action moves closer to food)6Linear Feature Functions Using a feature representation, we can write a q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but be very different in value!Function Approximation Q-learning with linear q-functions: Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states with that state’s features Formal justification: online least squares7Example: Q-PacmanLinear regression0102030400102030202224260 10 2002040Given examplesPredictgiven a new point80 2002040010203040010203020222426Linear regressionPredictionPredictionOrdinary Least Squares (OLS)0 200Error or “residual”PredictionObservation9Minimizing ErrorValue update explained:0 2 4 6 8 10 12 14 16 18 20-15-10-5051015202530[DEMO]Degree 15 polynomialOverfitting10Policy SearchPolicy Search Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions We’ll see this distinction between modeling and prediction againlater in the course Solution: learn the policy that maximizes rewards rather than the value that predicts rewards This is the idea behind policy search, such as what controlled the upside-down helicopter11Policy Search Simplest policy search: Start with an initial linear value function or q-function Nudge each feature weight up and down and see if your policy is better than before Problems: How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impracticalPolicy Search* Advanced policy search: Write a stochastic (soft) policy: Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them) Take uphill steps, recalculate derivatives, etc.12Take a Deep Breath… We’re done with search and planning! Next, we’ll look at how to reason with probabilities Diagnosis Tracking objects Speech recognition Robot mapping … lots more! Last part of course: machine
View Full Document