# OSU CS 533 - RL for Large State Spaces: Policy Gradient (30 pages)

Previewing pages*1, 2, 14, 15, 29, 30*of 30 page document

**View the full content.**## RL for Large State Spaces: Policy Gradient

Previewing pages *1, 2, 14, 15, 29, 30*
of
actual document.

**View the full content.**View Full Document

## RL for Large State Spaces: Policy Gradient

0 0 35 views

Other

- Pages:
- 30
- School:
- Oregon State University
- Course:
- Cs 533 - Intelligent Agents And Decision Making

**Unformatted text preview: **

RL for Large State Spaces Policy Gradient Alan Fern 1 RL via Policy Gradient Search So far all of our RL techniques have tried to learn an exact or approximate utility function or Q function Learn optimal value of being in a state or taking an action from state Value functions can often be much more complex to represent than the corresponding policy Do we really care about knowing Q s left 0 3554 Q s right 0 533 Or just that right is better than left in state s Motivates searching directly in a parameterized policy space Bypass learning value function and directly optimize the value of a policy 2 Aside Gradient Ascent Given a function f 1 n of n real values 1 n suppose we want to maximize f with respect to A common approach to doing this is gradient ascent The gradient of f at point denoted by f is an n dimensional vector that points in the direction where f increases most steeply at point Vector calculus tells us that f is just a vector of partial derivatives f f f 1 n f 1 i 1 i i 1 n f f lim where i 0 3 Aside Gradient Ascent Gradient ascent iteratively follows the gradient direction starting at some initial point Initialize to a random value Repeat until stopping condition f With proper decay of learning rate gradient descent is guaranteed to converge to local optima Local optima of f 2 1 4 RL via Policy Gradient Ascent The policy gradient approach has the following schema 1 Select a space of parameterized policies 2 Compute the gradient of the value of current policy wrt parameters 3 Move parameters in the direction of the gradient 4 Repeat these steps until we reach a local maxima 5 Possibly also add in tricks for dealing with bad local maxima e g random restarts So we must answer the following questions How should we represent parameterized policies How can we compute the gradient 5 Parameterized Policies One example of a space of parametric policies is s arg max Q s a a where Q s a may be a linear function e g Q s a 0 1 f1 s a 2 f 2 s a n f n s a The goal is

View Full Document