**Unformatted text preview:**

CS 188Fall 2019 Section Handout 6 SolutionsTemporal Difference LearningTemporal difference learning (TD learning) uses the idea of learning from every experience, rather than simplykeeping track of total rewards and number of times states are visited and learning at the end as direct evaluationdoes. In policy evaluation, we used the system of equations generated by our fixed policy and the Bellmanequation to determine the values of states under that policy (or used iterative updates like with value iteration).Vπ(s) =Xs0T (s, π(s), s0)[R(s, π(s), s0) + γVπ(s0)]Each of these equations equates the value of one state to the weighted average over the discounted values of thatstate’s successors plus the rewards reaped in transitioning to them. TD learning tries to answer the questionof how to compute this weighted average without the weights, cleverly doing so with an exponential movingaverage. We begin by initializing ∀s, Vπ(s) = 0. At each timestep, an agent takes an action π(s) from a states, transitions to a state s0, and receives a reward R(s, π(s), s0). We can obtain a sample value by summingthe received reward with the discounted current value of s0under π:sample = R(s, π(s), s0) + γVπ(s0)This sample is a new estimate for Vπ(s). The next step is to incorporate this sampled estimate into our existingmodel for Vπ(s) with the exponential moving average, which adheres to the following update rule:Vπ(s) ← (1 − α)Vπ(s) + α · sampleAbove, α is a parameter constrained by 0 ≤ α ≤ 1 known as the learning rate that specifies the weight wewant to assign our existing model for Vπ(s), 1 − α, and the weight we want to assign our new sampled estimate,α. It’s typical to start out with learning rate of α = 1, accordingly assigning Vπ(s) to whatever the first samplehappens to be, and slowly shrinking it towards 0, at which point all subsequent samples will be zeroed out andstop affecting our model of Vπ(s).Let’s stop and analyze the update rule for a minute. Annotating the state of our model at different points intime by defining Vπk(s) and samplekas the estimated value of state s after the kthupdate and the kthsamplerespectively, we can reexpress our update rule:Vπk(s) ← (1 − α)Vπk−1(s) + α · samplekThis recursive definition for Vπk(s) happens to be very interesting to expand:Vπk(s) ← (1 − α)Vπk−1(s) + α · samplekVπk(s) ← (1 − α)[(1 − α)Vπk−2(s) + α · samplek−1] + α · samplekVπk(s) ← (1 − α)2Vπk−2(s) + (1 − α) · α · samplek−1+ α · samplek...Vπk(s) ← (1 − α)kVπ0(s) + α · [(1 − α)k−1· sample1+ . . . + (1 − α) · samplek−1+ samplek]Vπk(s) ← α · [(1 − α)k−1· sample1+ . . . + (1 − α) · samplek−1+ samplek]Because 0 ≤ (1 − α) ≤ 1, as we raise the quantity (1 − α) to increasingly larger powers, it grows closer andcloser to 0. By the update rule expansion we derived, this means that older samples are given exponentially less1weight, exactly what we want since these older samples are computed using older (and hence worse) versions ofour model for Vπ(s)! This is the beauty of temporal difference learning - with a single straightfoward updaterule, we are able to:• learn at every timestep, hence using information about state transitions as we get them since we’re usingiteratively updating versions of Vπ(s0) in our samples rather than waiting until the end to perform anycomputation.• give exponentially less weight to older, potentially less accurate samples.• converge to learning true state values much faster with fewer episodes than direct evaluation.Q-LearningBoth direct evaluation and TD learning will eventually learn the true value of all states under the policy theyfollow. However, they both have a major inherent issue - we want to find an optimal policy for our agent,which requires knowledge of the q-values of states. To compute q-values from the values we have, we require atransition function and reward function as dictated by the Bellman equation.Q∗(s, a) =Xs0T (s, a, s0)[R(s, a, s0) + γV∗(s0)]Resultingly, TD learning or direct evaluation are typically used in tandem with some model-based learning toacquire estimates of T and R in order to effectively update the policy followed by the learning agent. Thisbecame avoidable by a revolutionary new idea known as Q-learning, which proposed learning the q-values ofstates directly, bypassing the need to ever know any values, transition functions, or reward functions. As aresult, Q-learning is entirely model-free. Q-learning uses the following update rule to perform what’s known asq-value iteration:Qk+1(s, a) ←Xs0T (s, a, s0)[R(s, a, s0) + γ maxa0Qk(s0, a0)]Note that this update is only a slight modification over the update rule for value iteration. Indeed, the only realdifference is that the position of the max operator over actions has been changed since we select an action beforetransitioning when we’re in a state, but we transition before selecting a new action when we’re in a q-state.With this new update rule under our belt, Q-learning is derived essentially the same way as TD learning, byacquiring q-value samples:sample = R(s, a, s0) + γ maxa0Q(s0, a0)and incoporating them into an exponential moving average.Q(s, a) ← (1 − α)Q(s, a) + α · sampleAs long as we spend enough time in exploration and decrease the learning rate α at an appropriate pace,Q-learning learns the optimal q-values for every q-state. This is what makes Q-learning so revolutionary -while TD learning and direct evaluation learn the values of states under a policy by following the policy beforedetermining policy optimality via other techniques, Q-learning can learn the optimal policy directly even bytaking suboptimal or random actions. This is called off-policy learning (contrary to direct evaluation andTD learning, which are examples of on-policy learning).2Q1. Pacman with Feature-Based Q-LearningWe would like to use a Q-learning agent for Pacman, but the size of the state space for a large grid is toomassive to hold in memory. To solve this, we will switch to feature-based representation of Pacman’s state.(a) We will have two features, Fgand Fp, defined as follows:Fg(s, a) = A(s) + B(s, a) + C(s, a)Fp(s, a) = D(s) + 2E(s, a)whereA(s) = number of ghosts within 1 step of state sB(s, a) = number of ghosts Pacman touches after taking action a from state sC(s, a) = number of ghosts within 1 step of the

View Full Document