Unformatted text preview:

PowerPoint PresentationSo far ….Reinforcement LearningPure Reinforcement Learning vs. Monte-Carlo PlanningSlide 5Passive vs. Active learningModel-Based vs. Model-Free RLSmall vs. Huge MDPsExample: Passive RLObjective: Value FunctionPassive RLApproach 1: Direct EstimationDirect EstimationApproach 2: Adaptive Dynamic Programming (ADP)ADP learning curvesApproach 3: Temporal Difference Learning (TD)Aside: Online Mean EstimationSlide 18Slide 19Slide 20Slide 21The TD learning curvePassive RL: ComparisonsBetween ADP and TDActive Reinforcement LearningNaïve Model-Based ApproachRevision of Naïve ApproachExploration versus ExploitationADP-based (model-based) RLExplore/Exploit PoliciesSlide 31Slide 32The Impact of TemperatureAlternative Model-Based Approach: Optimistic ExplorationOptimistic ExplorationOptimistic Value IterationOptimistic Exploration: ReviewAnother View of Optimistic Exploration: The Rmax AlgorithmRmax: Optimistic ModelSlide 40TD-based Active RLSlide 42TD-Based Active LearningQ-Learning: Model-Free RLSlide 45Q-LearningQ-Learning: Speedup for Goal-Based ProblemsSlide 48Q-Learning: Suggestions for Mini Project 2Active Reinforcement Learning SummaryADP vs. TD vs. QWhat about large state spaces?1Reinforcement LearningAlan Fern * Based in part on slides by Daniel Weld2So far ….Given an MDP model we know how to find optimal policies (for moderately-sized MDPs)Value Iteration or Policy IterationGiven just a simulator of an MDP we know how to select actionsMonte-Carlo PlanningWhat if we don’t have a model or simulator?Like when we were babies . . . Like in many real-world applicationsAll we can do is wander around the world observing what happens, getting rewarded and punishedEnters reinforcement learning3Reinforcement LearningNo knowledge of environmentCan only act in the world and observe states and rewardMany factors make RL difficult:Actions have non-deterministic effectsWhich are initially unknownRewards / punishments are infrequentOften at the end of long sequences of actionsHow do we determine what action(s) were really responsible for reward or punishment? (credit assignment)World is large and complexNevertheless learner must decide what actions to takeWe will assume the world behaves as an MDP4Pure Reinforcement Learning vs. Monte-Carlo PlanningIn pure reinforcement learning:the agent begins with no knowledgewanders around the world observing outcomesIn Monte-Carlo planningthe agent begins with no declarative knowledge of the worldhas an interface to a world simulator that allows observing the outcome of taking any action in any stateThe simulator gives the agent the ability to “teleport” to any state, at any time, and then apply any actionA pure RL agent does not have the ability to teleport Can only observe the outcomes that it happens to reach5Pure Reinforcement Learning vs. Monte-Carlo PlanningMC planning is sometimes called RL with a “strong simulator”I.e. a simulator where we can set the current state to any state at any momentPure RL is sometimes called RL with a “weak simulator”I.e. a simulator where we cannot set the stateA strong simulator can emulate a weak simulatorSo pure RL can be used in the MC planning frameworkBut not vice versa6Passive vs. Active learningPassive learningThe agent has a fixed policy and tries to learn the utilities of states by observing the world go byAnalogous to policy evaluationOften serves as a component of active learning algorithmsOften inspires active learning algorithmsActive learningThe agent attempts to find an optimal (or at least good) policy by acting in the worldAnalogous to solving the underlying MDP, but without first being given the MDP model7Model-Based vs. Model-Free RLModel based approach to RL: learn the MDP model, or an approximation of ituse it for policy evaluation or to find the optimal policyModel free approach to RL:derive the optimal policy without explicitly learning the model useful when model is difficult to represent and/or learnWe will consider both types of approaches8Small vs. Huge MDPsWe will first cover RL methods for small MDPsMDPs where the number of states and actions is reasonably smallThese algorithms will inspire more advanced methodsLater we will cover algorithms for huge MDPsFunction Approximation MethodsPolicy Gradient MethodsLeast-Squares Policy Iteration9Example: Passive RLSuppose given a stationary policy (shown by arrows)Actions can stochastically lead to unintended grid cellWant to determine how good it is10Objective: Value Function11Passive RLEstimate V(s)Not given transition matrix, nor reward function!Follow the policy for many epochs giving training sequences.Assume that after entering +1 or -1 state the agent enters zero reward terminal stateSo we don’t bother showing those transitions(1,1)(1,2)(1,3)(1,2)(1,3) (2,3)(3,3) (3,4) +1(1,1)(1,2)(1,3)(2,3)(3,3) (3,2)(3,3)(3,4) +1(1,1)(2,1)(3,1)(3,2)(4,2) -112Approach 1: Direct EstimationDirect estimation (also called Monte Carlo)Estimate V(s) as average total reward of epochs containing s (calculating from s to end of epoch)Reward to go of a state s the sum of the (discounted) rewards from that state until a terminal state is reachedKey: use observed reward to go of the state as the direct evidence of the actual expected utility of that stateAveraging the reward-to-go samples will converge to true value at state13Direct EstimationConverge very slowly to correct utilities values (requires a lot of sequences)Doesn’t exploit Bellman constraints on policy valuesIt is happy to consider value function estimates that violate this property badly.)'()',,()()('sVsasTsRsVsHow can we incorporate the Bellman constraints?14Approach 2: Adaptive Dynamic Programming (ADP)ADP is a model based approachFollow the policy for awhileEstimate transition model based on observationsLearn reward functionUse estimated model to compute utility of policyHow can we estimate transition model T(s,a,s’)?Simply the fraction of times we see s’ after taking a in state s.NOTE: Can bound error with Chernoff bounds if we want )'()',,()()('sVsasTsRsVslearned15ADP learning


View Full Document

OSU CS 533 - Reinforcement Learning

Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?