CS 188: Artificial Intelligence Fall 2008Reinforcement LearningModel-Free LearningQ-LearningQ-Learning PropertiesExploration / ExploitationExploration FunctionsSlide 9Slide 10Example: PacmanFeature-Based RepresentationsLinear Feature FunctionsFunction ApproximationExample: Q-PacmanLinear RegressionSlide 17Ordinary Least Squares (OLS)Minimizing ErrorOverfittingPolicy SearchSlide 22Slide 23Slide 24Policy Search*Take a Deep Breath…CS 188: Artificial IntelligenceFall 2008Lecture 12: Reinforcement Learning10/7/2008Dan Klein – UC BerkeleyMany slides over the course adapted from either Stuart Russell or Andrew Moore1Reinforcement LearningReinforcement learning:Still have an MDP:A set of states s SA set of actions (per state) AA model T(s,a,s’)A reward function R(s,a,s’)Still looking for a policy (s)New twist: don’t know T or RI.e. don’t know which states are good or what the actions doMust actually try actions and states out to learn[DEMO]3Model-Free LearningTemporal difference learningUpdate each time we experience a transitionFrequent outcomes will contribute more updates (over time)4(s)ss, (s)s’Q-LearningLearn Q*(s,a) valuesReceive a sample (s,a,s’,r)Consider your old estimate:Consider your new sample estimate:Incorporate the new estimate into a running average:[DEMO – Grid Q’s]5Q-Learning PropertiesWill converge to optimal policyIf you explore enoughIf you make the learning rate small enoughBut not decrease it too quickly!Basically doesn’t matter how you select actions (!)Neat property: learns optimal q-values regardless of action selection noise (some caveats)S ES E[DEMO – Grid Q’s]6Exploration / ExploitationSeveral schemes for forcing explorationSimplest: random actions ( greedy)Every time step, flip a coinWith probability , act randomlyWith probability 1-, act according to current policyProblems with random actions?You do explore the space, but keep thrashing around once learning is doneOne solution: lower over timeAnother solution: exploration functions[DEMO – RL Pacman]7Exploration FunctionsWhen to exploreRandom actions: explore a fixed amountBetter idea: explore areas whose badness is not (yet) establishedExploration functionTakes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)8Q-LearningQ-learning produces tables of q-values:[DEMO – Crawler Q’s]9Q-LearningIn realistic situations, we cannot possibly learn about every single state!Too many states to visit them all in trainingToo many states to hold the q-tables in memoryInstead, we want to generalize:Learn about some small number of training states from experienceGeneralize that experience to new, similar statesThis is a fundamental idea in machine learning, and we’ll see it over and over again10Example: PacmanLet’s say we discover through experience that this state is bad:In naïve q learning, we know nothing about this state or its q states:Or even this one!11Feature-Based RepresentationsSolution: describe a state using a vector of featuresFeatures are functions from states to real numbers (often 0/1) that capture important properties of the stateExample features:Distance to closest ghostDistance to closest dotNumber of ghosts1 / (dist to dot)2Is Pacman in a tunnel? (0/1)…… etc.Is it the exact state on this slide?Can also describe a q-state (s, a) with features (e.g. action moves closer to food)12Linear Feature FunctionsUsing a feature representation, we can write a q function (or value function) for any state using a few weights:Advantage: our experience is summed up in a few powerful numbersDisadvantage: states may share features but be very different in value!13Function ApproximationQ-learning with linear q-functions:Intuitive interpretation:Adjust weights of active featuresE.g. if something unexpectedly bad happens, disprefer all states with that state’s featuresFormal justification: online least squares14Example: Q-Pacman15Linear Regression0102030400102030202224260 10 2002040Given examplesPredictgiven a new point160 2002040010203040010203020222426Linear RegressionPredictionPrediction17Ordinary Least Squares (OLS)0 200Error or “residual”PredictionObservation18Minimizing ErrorApproximate q update explained:190 2 4 6 8 10 12 14 16 18 20-15-10-5051015202530[DEMO]Degree 15 polynomialOverfittingPolicy Search[DEMO – Helicopter]21Policy Search22Policy SearchProblem: often the feature-based policies that work well aren’t the ones that approximate V / Q bestE.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisionsWe’ll see this distinction between modeling and prediction again later in the courseSolution: learn the policy that maximizes rewards rather than the value that predicts rewardsThis is the idea behind policy search, such as what controlled the upside-down helicopter23Policy SearchSimplest policy search:Start with an initial linear value function or q-functionNudge each feature weight up and down and see if your policy is better than beforeProblems:How do we tell the policy got better?Need to run many sample episodes!If there are a lot of features, this can be impractical24Policy Search*Advanced policy search:Write a stochastic (soft) policy:Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, optional material)Take uphill steps, recalculate derivatives, etc.25Take a Deep Breath…We’re done with search and planning!Next, we’ll look at how to reason with probabilitiesDiagnosisTracking objectsSpeech recognitionRobot mapping… lots more!Last part of course: machine
View Full Document