OSU CS 533 - RL for Large State Spaces: - D2014683

Home> Schools> Oregon State University> Computer Science (CS) > CS 533> RL for Large State Spaces:

OSU CS 533 - RL for Large State Spaces:

Course Cs 533- Intelligent Agents And Decision Making

Pages 26

Download Save

Unformatted text preview:

1 RL for Large State Spaces: Value Function Approximation Alan Fern * Based in part on slides by Daniel Weld2 Large State Spaces  When a problem has a large state space we can not longer represent the V or Q functions as explicit tables  Even if we had enough memory Never enough training data! Learning takes too long  What to do??3 Function Approximation  Never enough training data!  Must generalize what is learned from one situation to other “similar” new situations  Idea:  Instead of using large table to represent V or Q, use a parameterized function  The number of parameters should be small compared to number of states (generally exponentially fewer parameters)  Learn parameters from experience  When we update the parameters based on observations in one state, then our V or Q estimate will also change for other similar states  I.e. the parameterization facilitates generalization of experience4 Linear Function Approximation  Define a set of state features f1(s), …, fn(s)  The features are used as our representation of states  States with similar feature values will be considered to be similar  A common approximation is to represent V(s) as a weighted sum of the features (i.e. a linear approximation)  The approximation accuracy is fundamentally limited by the information provided by the features  Can we always define features that allow for a perfect linear approximation?  Yes. Assign each state an indicator feature. (I.e. i’th feature is 1 iff i’th state is present and i represents value of i’th state)  Of course this requires far to many features and gives no generalization. )(...)()()(ˆ22110sfsfsfsVnn5 Example  Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal  Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features)  V(s) = 0 + 1 x + 2 y  Is there a good linear approximation?  Yes.  0 =10, 1 = -1, 2 = -1  (note upper right is origin)  V(s) = 10 - x - y subtracts Manhattan dist. from goal reward 10 0 0 6 66 But What If We Change Reward …  V(s) = 0 + 1 x + 2 y  Is there a good linear approximation?  No. 10 0 07 But What If…  V(s) = 0 + 1 x + 2 y 10 + 3 z  Include new feature z z= |3-x| + |3-y| z is dist. to goal location  Does this allow a good linear approx? 0 =10, 1 = 2 = 0, 0 = -1 0 0 3 38 Linear Function Approximation  Define a set of features f1(s), …, fn(s) The features are used as our representation of states States with similar feature values will be treated similarly More complex functions require more complex features  Our goal is to learn good parameter values (i.e. feature weights) that approximate the value function well How can we do this? Use TD-based RL and somehow update parameters based on each experience. )(...)()()(ˆ22110sfsfsfsVnn9 TD-based RL for Linear Approximators 1. Start with initial parameter values 2. Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) 3. Update estimated model (if model is not available) 4. Perform TD update for each parameter 5. Goto 2 What is a “TD update” for a parameter? ?i10 Aside: Gradient Descent  Given a function f(1,…, n) of n real values =(1,…, n) suppose we want to minimize f with respect to   A common approach to doing this is gradient descent  The gradient of f at point , denoted by  f(), is an n-dimensional vector that points in the direction where f increases most steeply at point   Vector calculus tells us that  f() is just a vector of partial derivatives where Can decrease f by moving in negative gradient direction )(),,,,,(lim)(1110fffniiiinfff)(,,)()(111 Aside: Gradient Descent for Squared Error  Suppose that we have a sequence of states and target values for each state  E.g. produced by the TD-based RL loop  Our goal is to minimize the sum of squared errors between our estimated function and each target value:  After seeing j’th state the gradient descent rule tells us that we can decrease error by updating parameters by:  2)()(ˆ21jjjsvsVE squared error of example j our estimated value for j’th state learning rate target value for j’th state ,)(,,)(,2211svssvsijiiE12 Aside: continued  ijjjiijiisVsvsVE)(ˆ)()(ˆ)(ˆjjsVE• For a linear approximation function: • Thus the update becomes: • For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule )(...)()()(ˆ22111sfsfsfsVnn )()(ˆ)(jijjiisfsVsv)()(ˆjiijsfsVdepends on form of approximator  2)()(ˆ21jjjsvsVE 13 TD-based RL for Linear Approximators 1. Start with initial parameter values 2. Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) Transition from s to s’ 3. Update estimated model 4. Perform TD update for each parameter 5. Goto 2 What should we use for “target value” v(s)?  )()(ˆ)( sfsVsviii• Use the TD prediction based on the next state s’ this is the same as previous TD method only with approximation )'(ˆ)()( sVsRsv14 TD-based RL for Linear Approximators 1. Start with initial parameter values 2. Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) 3. Update estimated model 4. Perform TD update for each parameter 5. Goto 2  )()(ˆ)'(ˆ)( sfsVsVsRiii• Step 2 requires a model to select greedy action • For some applications (e.g. Backgammon as we will see later) it is easy to get a model (but not easy to get a policy) • For others it is difficult to get a good model15 Q-function Approximation  Define a set of features over state-action pairs: f1(s,a), …, fn(s,a) State-action pairs with similar feature values will be treated similarly More

View Full Document


School:
Email:
New Password:
Confirm Password:

OSU CS 533 - RL for Large State Spaces:

Sign up for free to view:

Please select your school