1 RL for Large State Spaces: Value Function Approximation Alan Fern * Based in part on slides by Daniel Weld2 Large State Spaces When a problem has a large state space we can not longer represent the V or Q functions as explicit tables Even if we had enough memory Never enough training data! Learning takes too long What to do??3 Function Approximation Never enough training data! Must generalize what is learned from one situation to other “similar” new situations Idea: Instead of using large table to represent V or Q, use a parameterized function The number of parameters should be small compared to number of states (generally exponentially fewer parameters) Learn parameters from experience When we update the parameters based on observations in one state, then our V or Q estimate will also change for other similar states I.e. the parameterization facilitates generalization of experience4 Linear Function Approximation Define a set of state features f1(s), …, fn(s) The features are used as our representation of states States with similar feature values will be considered to be similar A common approximation is to represent V(s) as a weighted sum of the features (i.e. a linear approximation) The approximation accuracy is fundamentally limited by the information provided by the features Can we always define features that allow for a perfect linear approximation? Yes. Assign each state an indicator feature. (I.e. i’th feature is 1 iff i’th state is present and i represents value of i’th state) Of course this requires far to many features and gives no generalization. )(...)()()(ˆ22110sfsfsfsVnn5 Example Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal Features for state s=(x,y): f1(s)=x, f2(s)=y (just 2 features) V(s) = 0 + 1 x + 2 y Is there a good linear approximation? Yes. 0 =10, 1 = -1, 2 = -1 (note upper right is origin) V(s) = 10 - x - y subtracts Manhattan dist. from goal reward 10 0 0 6 66 But What If We Change Reward … V(s) = 0 + 1 x + 2 y Is there a good linear approximation? No. 10 0 07 But What If… V(s) = 0 + 1 x + 2 y 10 + 3 z Include new feature z z= |3-x| + |3-y| z is dist. to goal location Does this allow a good linear approx? 0 =10, 1 = 2 = 0, 0 = -1 0 0 3 38 Linear Function Approximation Define a set of features f1(s), …, fn(s) The features are used as our representation of states States with similar feature values will be treated similarly More complex functions require more complex features Our goal is to learn good parameter values (i.e. feature weights) that approximate the value function well How can we do this? Use TD-based RL and somehow update parameters based on each experience. )(...)()()(ˆ22110sfsfsfsVnn9 TD-based RL for Linear Approximators 1. Start with initial parameter values 2. Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) 3. Update estimated model (if model is not available) 4. Perform TD update for each parameter 5. Goto 2 What is a “TD update” for a parameter? ?i10 Aside: Gradient Descent Given a function f(1,…, n) of n real values =(1,…, n) suppose we want to minimize f with respect to A common approach to doing this is gradient descent The gradient of f at point , denoted by f(), is an n-dimensional vector that points in the direction where f increases most steeply at point Vector calculus tells us that f() is just a vector of partial derivatives where Can decrease f by moving in negative gradient direction )(),,,,,(lim)(1110fffniiiinfff)(,,)()(111 Aside: Gradient Descent for Squared Error Suppose that we have a sequence of states and target values for each state E.g. produced by the TD-based RL loop Our goal is to minimize the sum of squared errors between our estimated function and each target value: After seeing j’th state the gradient descent rule tells us that we can decrease error by updating parameters by: 2)()(ˆ21jjjsvsVE squared error of example j our estimated value for j’th state learning rate target value for j’th state ,)(,,)(,2211svssvsijiiE12 Aside: continued ijjjiijiisVsvsVE)(ˆ)()(ˆ)(ˆjjsVE• For a linear approximation function: • Thus the update becomes: • For linear functions this update is guaranteed to converge to best approximation for suitable learning rate schedule )(...)()()(ˆ22111sfsfsfsVnn )()(ˆ)(jijjiisfsVsv)()(ˆjiijsfsVdepends on form of approximator 2)()(ˆ21jjjsvsVE 13 TD-based RL for Linear Approximators 1. Start with initial parameter values 2. Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) Transition from s to s’ 3. Update estimated model 4. Perform TD update for each parameter 5. Goto 2 What should we use for “target value” v(s)? )()(ˆ)( sfsVsviii• Use the TD prediction based on the next state s’ this is the same as previous TD method only with approximation )'(ˆ)()( sVsRsv14 TD-based RL for Linear Approximators 1. Start with initial parameter values 2. Take action according to an explore/exploit policy (should converge to greedy policy, i.e. GLIE) 3. Update estimated model 4. Perform TD update for each parameter 5. Goto 2 )()(ˆ)'(ˆ)( sfsVsVsRiii• Step 2 requires a model to select greedy action • For some applications (e.g. Backgammon as we will see later) it is easy to get a model (but not easy to get a policy) • For others it is difficult to get a good model15 Q-function Approximation Define a set of features over state-action pairs: f1(s,a), …, fn(s,a) State-action pairs with similar feature values will be treated similarly More
View Full Document