PowerPoint PresentationRL via Policy Gradient SearchAside: Gradient AscentSlide 4RL via Policy Gradient AscentParameterized PoliciesPolicy Gradient AscentSlide 8Gradient EstimationExample: Discontinous ()Probabilistic PoliciesEmpirical Gradient EstimationSlide 13Likelihood Ratio Gradient EstimationGeneral Likelihood Ratio Gradient EstimateSlide 16Application to Policy GradientSlide 18Slide 19Slide 20Basic Policy Gradient AlgorithmToward Online AlgorithmSlide 23Online Policy Gradient (OLPOMDP)InterpretationComputing the Gradient of PolicyControlling HelicoptersQuadruped LocomotionProactive SecurityPolicy Gradient Recap1RL for Large State Spaces: Policy GradientAlan Fern2RL via Policy Gradient SearchSo far all of our RL techniques have tried to learn an exact or approximate utility function or Q-functionLearn optimal “value” of being in a state, or taking an action from state. Value functions can often be much more complex to represent than the corresponding policyDo we really care about knowing Q(s,left) = 0.3554, Q(s,right) = 0.533Or just that “right is better than left in state s”Motivates searching directly in a parameterized policy spaceBypass learning value function and “directly” optimize the value of a policy3Aside: Gradient AscentGiven a function f(1,…, n) of n real values =(1,…, n) suppose we want to maximize f with respect to A common approach to doing this is gradient ascentThe gradient of f at point , denoted by f(), is an n-dimensional vector that points in the direction where f increases most steeply at point Vector calculus tells us that f() is just a vector of partial derivativeswhere )(),,,,,(lim)(1110fffniiiinfff)(,,)()(14Aside: Gradient AscentGradient ascent iteratively follows the gradient direction starting at some initial pointInitialize to a random value Repeat until stopping condition)(f21Local optima of fWith proper decay of learning rate gradient descent is guaranteed to converge to local optima.5RL via Policy Gradient AscentThe policy gradient approach has the following schema:1. Select a space of parameterized policies2. Compute the gradient of the value of current policy wrt parameters3. Move parameters in the direction of the gradient4. Repeat these steps until we reach a local maxima5. Possibly also add in tricks for dealing with bad local maxima (e.g. random restarts)So we must answer the following questions:How should we represent parameterized policies?How can we compute the gradient?6Parameterized PoliciesOne example of a space of parametric policies is: where may be a linear function, e.g.The goal is to learn parameters that give a good policyNote that it is not important that be close to the actual Q-functionRather we only require is good at ranking actions in order of goodness),(...),(),(),(ˆ22110asfasfasfasQnn),(ˆmaxarg)( asQsa),(ˆasQ),(ˆasQ),(ˆasQ7Policy Gradient AscentFor simplicity we will make the following assumptions:Each run/trajectory of a policy starts from a fixed initial stateEach run/trajectory always reaches a terminal state in a finite number of stepsLet () be expected value of policy at initial state() is just the expected discounted total reward of a trajectory of Our objective is to find a that maximizes ()8Policy Gradient AscentPolicy gradient ascent tells us to iteratively update parameters via: Problem: () is generally very complex and it is rare that we can compute a closed form for the gradient of () even if we have an exact model of the system.Key idea: estimate the gradient based on experience)(9Gradient EstimationConcern: Computing or estimating the gradient of discontinuous functions can be problematic. For our example parametric policy is () continuous? No.There are values of where arbitrarily small changes, cause the policy to change.Since different policies can have different values this means that changing can cause discontinuous jump of (). ),(ˆmaxarg)( asQsa10Example: Discontinous ()Consider a problem with initial state s and two actions a1 and a2 a1 leads to a very large terminal reward R1 a2 leads to a very small terminal reward R2Fixing 2 to a constant we can plot the ranking assigned to each action by Q and the corresponding value ()),(),(ˆmaxarg)(11asfasQsa1)1,(ˆasQ)2,(ˆasQ1()R1 R2 Discontinuity in () when ordering of a1 and a2 change11Probabilistic PoliciesWe would like to avoid policies that drastically change with small parameter changes, leading to discontinuitiesA probabilistic policy takes a state as input and returns a distribution over actions Given a state s (s,a) returns the probability that selects action a in s Note that () is still well defined for probabilistic policiesNow uncertainty of trajectories comes from environment and policyImportantly if (s,a) is continuous relative to changing then () is also continuous relative to changing A common form for probabilistic policies is the softmax function or Boltzmann exploration function AaasQasQsaas')',(ˆexp),(ˆexp)|Pr(),(12Empirical Gradient EstimationOur first approach to estimating () is to simply compute empirical gradient estimatesRecall that = (1,…, n) and so we can compute the gradient by empirically estimating each partial derivative So for small we can estimate the partial derivatives byThis requires estimating n+1 values: )(),,,,,(lim)(1110niiiin)(,,)()(1)(),,,,,(111 niii niniii,...,1|),,,,,(),(11113Empirical Gradient EstimationHow do we estimate the quantities For each set of parameters, simply execute the policy for N trials/episodes and
View Full Document