DOC PREVIEW
UW-Madison ECE 539 - Reinforcement Learning and the Temporal Difference Algorithm

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lenz 1Reinforcement Learning and theTemporal Difference AlgorithmBy John LenzLenz 21. Introduction to Reinforcement LearningReinforcement learning is learning to maximize a reward signal by exploring manypossible actions. The agent is not told the correct actions; instead it explores the possible actionsand remembers the reward it receives. With supervised learning, an agent takes an action and isthen told what was the correct action. For example, the agent will classify a picture as a number“3” and the teacher will explain that the picture is the number “8”. In reinforcement learning, theagent takes an action and then receives a reward based on that action; there is no teacher to givethe correct action. In some problems like games such as checkers or chess, the “correct” actionisn't even known.Reinforcement learning can be applied to many control problems where there is no expertknowledge about the task. Reinforcement learning attempts to mimic one of the major the wayhumans learn. Instead of being told what to do, we learn through experience; in our interactionwith the environment, we feel pain and pleasure punishing or rewarding us for our actions. In asimilar way, a reinforcement learning agent learns to interact with an unknown and unspecifiedenvironment. Reinforcement learning can be applied to any goal directed and decision-makingproblem; specific knowledge about the environment and expert teaching are not required.The reinforcement learning problem model is an agent continuously interacting with anenvironment. The agent and the environment interact in a sequence of time steps. At each timestep t, the agent receives the state of the environment and a scalar numerical reward for theprevious action, and then the agent then selects an action. A time sequence t=1,2,3... can eitherform an episode, where the state is reset to the initial state and t is reset to 1 after a specificterminal state is reached, or time can continue marching towards infinity to form a continual task.A game like checkers or tic-tac-toe is an example of an episodic task, where the agent plays untilLenz 3the game is won or lost, while a robot controlled vacuum tasked with keeping a room clean is acontinuous task.Formally, the goal of the agent is to maximize the expected return, thatis, the sum of all future rewards discounted by a number between zero and one. For episodictasks like games, all future rewards past the end of the episode are defined to be zero. The goalof the agent is to learn a policy, which how the agent selects actions; it maps states to actions.All reinforcement learning algorithms are based on estimating the value function which is theexpected return starting from state st and following policy π. Thevalue function is an estimate of how good it is for an agent choosing actions based the policy π tobe in a given state. Similarly, the state-action value function is the expected return starting fromstate s, taking action a, and thereafter following policy π, defined asThe reinforcement learning problem is to find an optimal policy and optimal valuefunction. A policy π is defined to be better than policy π' if the expected return is greater than orequal for all states. That is, There is always atleast one policy that is greater than or equal to all other policies, and these form the set of optimalpolicies, denoted π* . They all share the same value function, called the optimal value function,denoted V*, and defined as . The optimal state-action value function isdefined as . AgentEnvironmentRewardStateActionV*s=maxVsVs=E{Rt|st=s}≥' if and only if Vs≥V's for all s.Q*s , a=maxQs , a=E {rt1V*st1| st=s ,at=a}Qs ,a= E{Rt| st=s ,at=a}Rt=∑k=0∞krtkLenz 42. Temporal Difference LearningAn obvious approach to learning the value function is to update the estimate of the valuefunction when the actual return is known. This method is called theconstant-α Monty Carlo method, where α is a learning parameter between 0 and 1. Since theactual return is the sum of all future rewards, this algorithm must wait until the end of theepisode when the expected return is known before the value function is updated. Richard Suttonproposed to instead estimate the expected return by the next reward plus the value of the nextstate. The update to the value function takes the difference of successive estimates of the valuefunction, thus the name temporal difference. The simplest method, known as TD(0), is given byMore generally, the expected return can be estimated by the next n rewards,Rtn=rt 1rt 22rt3...n− 1rt nnVtst n Vtst=[Rtn− Vtst]Richard Sutton proposed the TD(λ) algorithm, which mixes TD(0) and Monty Carlo methods anduses a weighted average of n-step returns. The resulting estimation of the expected return, calledthe λ-return is defined as If λ=0, thisreduces to Rt(1) which is the TD(0) algorithm. If λ=1, this reduces to Rt(∞) or just Rt which is theconstant-α Monty Carlo method. The parameter λ can vary between 0 and 1, and provides a tradeoff between updating based on the final result and updating based only on the next estimate.The definition given above is called the forward view of TD(λ), since the weight updaterequires knowledge about the future. Richard Sutton also proved an equivalent view of TD(λ),called the backward view. In the backward view of TD(λ), every state has an eligibility trace attime t, denoted et(s). On each step, the eligibility traces for all states are decayed by γλ and theV st=[Rt−V st]Vtst=[rt1Vtst1− Vtst]Rt=1−∑n=1∞n−1Rt nVtst=[Rt−Vtst]Lenz 5current state gets incremented by one. The TD(λ) algorithm becomesIn the backward view, the value function of all recently visited states is updated by δt, where“recently visited” is defined in terms of the states eligibility trace.Temporal difference learning can be easily extended to the control problem, that is,learning the optimal policy π*. One policy learning method is to use an on-line ε-greedy policywhile training: instead of learning V(s) learn Q(s,a), then every time-step select the action withthe largest value of Q(s,a) 1-ε percent of the time and


View Full Document

UW-Madison ECE 539 - Reinforcement Learning and the Temporal Difference Algorithm

Documents in this Course
Load more
Download Reinforcement Learning and the Temporal Difference Algorithm
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and the Temporal Difference Algorithm and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning and the Temporal Difference Algorithm 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?