PowerPoint PresentationSo far ….Reinforcement LearningPure Reinforcement Learning vs. Monte-Carlo PlanningSlide 5Passive vs. Active learningModel-Based vs. Model-Free RLSmall vs. Huge MDPsExample: Passive RLObjective: Value FunctionPassive RLApproach 1: Direct EstimationDirect EstimationApproach 2: Adaptive Dynamic Programming (ADP)ADP learning curvesApproach 3: Temporal Difference Learning (TD)Aside: Online Mean EstimationSlide 18Slide 19Slide 20Slide 21The TD learning curvePassive RL: ComparisonsBetween ADP and TDActive Reinforcement LearningNaïve Model-Based ApproachRevision of Naïve ApproachExploration versus ExploitationADP-based (model-based) RLExplore/Exploit PoliciesSlide 31Slide 32The Impact of TemperatureAlternative Model-Based Approach: Optimistic ExplorationOptimistic ExplorationOptimistic Value IterationOptimistic Exploration: ReviewAnother View of Optimistic Exploration: The Rmax AlgorithmRmax: Optimistic ModelSlide 40TD-based Active RLSlide 42TD-Based Active LearningQ-Learning: Model-Free RLSlide 45Q-LearningQ-Learning: Speedup for Goal-Based ProblemsSlide 48Q-Learning: Suggestions for Mini Project 2Active Reinforcement Learning SummaryADP vs. TD vs. QWhat about large state spaces?1Reinforcement LearningAlan Fern * Based in part on slides by Daniel Weld2So far ….Given an MDP model we know how to find optimal policies (for moderately-sized MDPs)Value Iteration or Policy IterationGiven just a simulator of an MDP we know how to select actionsMonte-Carlo PlanningWhat if we don’t have a model or simulator?Like when we were babies . . . Like in many real-world applicationsAll we can do is wander around the world observing what happens, getting rewarded and punishedEnters reinforcement learning3Reinforcement LearningNo knowledge of environmentCan only act in the world and observe states and rewardMany factors make RL difficult:Actions have non-deterministic effectsWhich are initially unknownRewards / punishments are infrequentOften at the end of long sequences of actionsHow do we determine what action(s) were really responsible for reward or punishment? (credit assignment)World is large and complexNevertheless learner must decide what actions to takeWe will assume the world behaves as an MDP4Pure Reinforcement Learning vs. Monte-Carlo PlanningIn pure reinforcement learning:the agent begins with no knowledgewanders around the world observing outcomesIn Monte-Carlo planningthe agent begins with no declarative knowledge of the worldhas an interface to a world simulator that allows observing the outcome of taking any action in any stateThe simulator gives the agent the ability to “teleport” to any state, at any time, and then apply any actionA pure RL agent does not have the ability to teleport Can only observe the outcomes that it happens to reach5Pure Reinforcement Learning vs. Monte-Carlo PlanningMC planning is sometimes called RL with a “strong simulator”I.e. a simulator where we can set the current state to any state at any momentPure RL is sometimes called RL with a “weak simulator”I.e. a simulator where we cannot set the stateA strong simulator can emulate a weak simulatorSo pure RL can be used in the MC planning frameworkBut not vice versa6Passive vs. Active learningPassive learningThe agent has a fixed policy and tries to learn the utilities of states by observing the world go byAnalogous to policy evaluationOften serves as a component of active learning algorithmsOften inspires active learning algorithmsActive learningThe agent attempts to find an optimal (or at least good) policy by acting in the worldAnalogous to solving the underlying MDP, but without first being given the MDP model7Model-Based vs. Model-Free RLModel based approach to RL: learn the MDP model, or an approximation of ituse it for policy evaluation or to find the optimal policyModel free approach to RL:derive the optimal policy without explicitly learning the model useful when model is difficult to represent and/or learnWe will consider both types of approaches8Small vs. Huge MDPsWe will first cover RL methods for small MDPsMDPs where the number of states and actions is reasonably smallThese algorithms will inspire more advanced methodsLater we will cover algorithms for huge MDPsFunction Approximation MethodsPolicy Gradient MethodsLeast-Squares Policy Iteration9Example: Passive RLSuppose given a stationary policy (shown by arrows)Actions can stochastically lead to unintended grid cellWant to determine how good it is10Objective: Value Function11Passive RLEstimate V(s)Not given transition matrix, nor reward function!Follow the policy for many epochs giving training sequences.Assume that after entering +1 or -1 state the agent enters zero reward terminal stateSo we don’t bother showing those transitions(1,1)(1,2)(1,3)(1,2)(1,3) (2,3)(3,3) (3,4) +1(1,1)(1,2)(1,3)(2,3)(3,3) (3,2)(3,3)(3,4) +1(1,1)(2,1)(3,1)(3,2)(4,2) -112Approach 1: Direct EstimationDirect estimation (also called Monte Carlo)Estimate V(s) as average total reward of epochs containing s (calculating from s to end of epoch)Reward to go of a state s the sum of the (discounted) rewards from that state until a terminal state is reachedKey: use observed reward to go of the state as the direct evidence of the actual expected utility of that stateAveraging the reward-to-go samples will converge to true value at state13Direct EstimationConverge very slowly to correct utilities values (requires a lot of sequences)Doesn’t exploit Bellman constraints on policy valuesIt is happy to consider value function estimates that violate this property badly.)'()',,()()('sVsasTsRsVsHow can we incorporate the Bellman constraints?14Approach 2: Adaptive Dynamic Programming (ADP)ADP is a model based approachFollow the policy for awhileEstimate transition model based on observationsLearn reward functionUse estimated model to compute utility of policyHow can we estimate transition model T(s,a,s’)?Simply the fraction of times we see s’ after taking a in state s.NOTE: Can bound error with Chernoff bounds if we want )'()',,()()('sVsasTsRsVslearned15ADP learning
View Full Document