OSU CS 533 - Reinforcement Learning - D390904

Home> Schools> Oregon State University> Computer Science (CS) > CS 533> Reinforcement Learning

OSU CS 533 - Reinforcement Learning

Course Cs 533- Intelligent Agents And Decision Making

Pages 52

Download Save

Unformatted text preview:

PowerPoint PresentationSo far ….Reinforcement LearningPure Reinforcement Learning vs. Monte-Carlo PlanningSlide 5Passive vs. Active learningModel-Based vs. Model-Free RLSmall vs. Huge MDPsExample: Passive RLObjective: Value FunctionPassive RLApproach 1: Direct EstimationDirect EstimationApproach 2: Adaptive Dynamic Programming (ADP)ADP learning curvesApproach 3: Temporal Difference Learning (TD)Aside: Online Mean EstimationSlide 18Slide 19Slide 20Slide 21The TD learning curvePassive RL: ComparisonsBetween ADP and TDActive Reinforcement LearningNaïve Model-Based ApproachRevision of Naïve ApproachExploration versus ExploitationADP-based (model-based) RLExplore/Exploit PoliciesSlide 31Slide 32The Impact of TemperatureAlternative Model-Based Approach: Optimistic ExplorationOptimistic ExplorationOptimistic Value IterationOptimistic Exploration: ReviewAnother View of Optimistic Exploration: The Rmax AlgorithmRmax: Optimistic ModelSlide 40TD-based Active RLSlide 42TD-Based Active LearningQ-Learning: Model-Free RLSlide 45Q-LearningQ-Learning: Speedup for Goal-Based ProblemsSlide 48Q-Learning: Suggestions for Mini Project 2Active Reinforcement Learning SummaryADP vs. TD vs. QWhat about large state spaces?1Reinforcement LearningAlan Fern * Based in part on slides by Daniel Weld2So far ….Given an MDP model we know how to find optimal policies (for moderately-sized MDPs)Value Iteration or Policy IterationGiven just a simulator of an MDP we know how to select actionsMonte-Carlo PlanningWhat if we don’t have a model or simulator?Like when we were babies . . . Like in many real-world applicationsAll we can do is wander around the world observing what happens, getting rewarded and punishedEnters reinforcement learning3Reinforcement LearningNo knowledge of environmentCan only act in the world and observe states and rewardMany factors make RL difficult:Actions have non-deterministic effectsWhich are initially unknownRewards / punishments are infrequentOften at the end of long sequences of actionsHow do we determine what action(s) were really responsible for reward or punishment? (credit assignment)World is large and complexNevertheless learner must decide what actions to takeWe will assume the world behaves as an MDP4Pure Reinforcement Learning vs. Monte-Carlo PlanningIn pure reinforcement learning:the agent begins with no knowledgewanders around the world observing outcomesIn Monte-Carlo planningthe agent begins with no declarative knowledge of the worldhas an interface to a world simulator that allows observing the outcome of taking any action in any stateThe simulator gives the agent the ability to “teleport” to any state, at any time, and then apply any actionA pure RL agent does not have the ability to teleport Can only observe the outcomes that it happens to reach5Pure Reinforcement Learning vs. Monte-Carlo PlanningMC planning is sometimes called RL with a “strong simulator”I.e. a simulator where we can set the current state to any state at any momentPure RL is sometimes called RL with a “weak simulator”I.e. a simulator where we cannot set the stateA strong simulator can emulate a weak simulatorSo pure RL can be used in the MC planning frameworkBut not vice versa6Passive vs. Active learningPassive learningThe agent has a fixed policy and tries to learn the utilities of states by observing the world go byAnalogous to policy evaluationOften serves as a component of active learning algorithmsOften inspires active learning algorithmsActive learningThe agent attempts to find an optimal (or at least good) policy by acting in the worldAnalogous to solving the underlying MDP, but without first being given the MDP model7Model-Based vs. Model-Free RLModel based approach to RL: learn the MDP model, or an approximation of ituse it for policy evaluation or to find the optimal policyModel free approach to RL:derive the optimal policy without explicitly learning the model useful when model is difficult to represent and/or learnWe will consider both types of approaches8Small vs. Huge MDPsWe will first cover RL methods for small MDPsMDPs where the number of states and actions is reasonably smallThese algorithms will inspire more advanced methodsLater we will cover algorithms for huge MDPsFunction Approximation MethodsPolicy Gradient MethodsLeast-Squares Policy Iteration9Example: Passive RLSuppose given a stationary policy (shown by arrows)Actions can stochastically lead to unintended grid cellWant to determine how good it is10Objective: Value Function11Passive RLEstimate V(s)Not given transition matrix, nor reward function!Follow the policy for many epochs giving training sequences.Assume that after entering +1 or -1 state the agent enters zero reward terminal stateSo we don’t bother showing those transitions(1,1)(1,2)(1,3)(1,2)(1,3) (2,3)(3,3) (3,4) +1(1,1)(1,2)(1,3)(2,3)(3,3) (3,2)(3,3)(3,4) +1(1,1)(2,1)(3,1)(3,2)(4,2) -112Approach 1: Direct EstimationDirect estimation (also called Monte Carlo)Estimate V(s) as average total reward of epochs containing s (calculating from s to end of epoch)Reward to go of a state s the sum of the (discounted) rewards from that state until a terminal state is reachedKey: use observed reward to go of the state as the direct evidence of the actual expected utility of that stateAveraging the reward-to-go samples will converge to true value at state13Direct EstimationConverge very slowly to correct utilities values (requires a lot of sequences)Doesn’t exploit Bellman constraints on policy valuesIt is happy to consider value function estimates that violate this property badly.)'()',,()()('sVsasTsRsVsHow can we incorporate the Bellman constraints?14Approach 2: Adaptive Dynamic Programming (ADP)ADP is a model based approachFollow the policy for awhileEstimate transition model based on observationsLearn reward functionUse estimated model to compute utility of policyHow can we estimate transition model T(s,a,s’)?Simply the fraction of times we see s’ after taking a in state s.NOTE: Can bound error with Chernoff bounds if we want )'()',,()()('sVsasTsRsVslearned15ADP learning

View Full Document


School:
Email:
New Password:
Confirm Password:

OSU CS 533 - Reinforcement Learning

Sign up for free to view:

Please select your school