OSU CS 533 - RL for Large State Spaces: Policy Gradient

Unformatted text preview:

PowerPoint PresentationRL via Policy Gradient SearchAside: Gradient AscentSlide 4RL via Policy Gradient AscentParameterized PoliciesPolicy Gradient AscentSlide 8Gradient EstimationExample: Discontinous ()Probabilistic PoliciesEmpirical Gradient EstimationSlide 13Likelihood Ratio Gradient EstimationGeneral Likelihood Ratio Gradient EstimateSlide 16Application to Policy GradientSlide 18Slide 19Slide 20Basic Policy Gradient AlgorithmToward Online AlgorithmSlide 23Online Policy Gradient (OLPOMDP)InterpretationComputing the Gradient of PolicyControlling HelicoptersQuadruped LocomotionProactive SecurityPolicy Gradient Recap1RL for Large State Spaces: Policy GradientAlan Fern2RL via Policy Gradient SearchSo far all of our RL techniques have tried to learn an exact or approximate utility function or Q-functionLearn optimal “value” of being in a state, or taking an action from state. Value functions can often be much more complex to represent than the corresponding policyDo we really care about knowing Q(s,left) = 0.3554, Q(s,right) = 0.533Or just that “right is better than left in state s”Motivates searching directly in a parameterized policy spaceBypass learning value function and “directly” optimize the value of a policy3Aside: Gradient AscentGiven a function f(1,…, n) of n real values =(1,…, n) suppose we want to maximize f with respect to A common approach to doing this is gradient ascentThe gradient of f at point , denoted by  f(), is an n-dimensional vector that points in the direction where f increases most steeply at point Vector calculus tells us that  f() is just a vector of partial derivativeswhere )(),,,,,(lim)(1110fffniiiinfff)(,,)()(14Aside: Gradient AscentGradient ascent iteratively follows the gradient direction starting at some initial pointInitialize  to a random value Repeat until stopping condition)(f21Local optima of fWith proper decay of learning rate gradient descent is guaranteed to converge to local optima.5RL via Policy Gradient AscentThe policy gradient approach has the following schema:1. Select a space of parameterized policies2. Compute the gradient of the value of current policy wrt parameters3. Move parameters in the direction of the gradient4. Repeat these steps until we reach a local maxima5. Possibly also add in tricks for dealing with bad local maxima (e.g. random restarts)So we must answer the following questions:How should we represent parameterized policies?How can we compute the gradient?6Parameterized PoliciesOne example of a space of parametric policies is: where may be a linear function, e.g.The goal is to learn parameters  that give a good policyNote that it is not important that be close to the actual Q-functionRather we only require is good at ranking actions in order of goodness),(...),(),(),(ˆ22110asfasfasfasQnn),(ˆmaxarg)( asQsa),(ˆasQ),(ˆasQ),(ˆasQ7Policy Gradient AscentFor simplicity we will make the following assumptions:Each run/trajectory of a policy starts from a fixed initial stateEach run/trajectory always reaches a terminal state in a finite number of stepsLet () be expected value of policy  at initial state()  is just the expected discounted total reward of a trajectory of  Our objective is to find a  that maximizes ()8Policy Gradient AscentPolicy gradient ascent tells us to iteratively update parameters via: Problem: () is generally very complex and it is rare that we can compute a closed form for the gradient of () even if we have an exact model of the system.Key idea: estimate the gradient based on experience)(9Gradient EstimationConcern: Computing or estimating the gradient of discontinuous functions can be problematic. For our example parametric policy is () continuous? No.There are values of  where arbitrarily small changes, cause the policy to change.Since different policies can have different values this means that changing  can cause discontinuous jump of (). ),(ˆmaxarg)( asQsa10Example: Discontinous ()Consider a problem with initial state s and two actions a1 and a2 a1 leads to a very large terminal reward R1 a2 leads to a very small terminal reward R2Fixing 2 to a constant we can plot the ranking assigned to each action by Q and the corresponding value ()),(),(ˆmaxarg)(11asfasQsa1)1,(ˆasQ)2,(ˆasQ1()R1 R2 Discontinuity in () when ordering of a1 and a2 change11Probabilistic PoliciesWe would like to avoid policies that drastically change with small parameter changes, leading to discontinuitiesA probabilistic policy  takes a state as input and returns a distribution over actions Given a state s (s,a) returns the probability that  selects action a in s Note that () is still well defined for probabilistic policiesNow uncertainty of trajectories comes from environment and policyImportantly if (s,a) is continuous relative to changing  then () is also continuous relative to changing A common form for probabilistic policies is the softmax function or Boltzmann exploration function   AaasQasQsaas')',(ˆexp),(ˆexp)|Pr(),(12Empirical Gradient EstimationOur first approach to estimating  () is to simply compute empirical gradient estimatesRecall that  = (1,…,  n) and so we can compute the gradient by empirically estimating each partial derivative So for small  we can estimate the partial derivatives byThis requires estimating n+1 values: )(),,,,,(lim)(1110niiiin)(,,)()(1)(),,,,,(111 niii niniii,...,1|),,,,,(),(11113Empirical Gradient EstimationHow do we estimate the quantities For each set of parameters, simply execute the policy for N trials/episodes and


View Full Document

OSU CS 533 - RL for Large State Spaces: Policy Gradient

Download RL for Large State Spaces: Policy Gradient
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view RL for Large State Spaces: Policy Gradient and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view RL for Large State Spaces: Policy Gradient 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?