DOC PREVIEW
UW-Madison ECE 539 - Temporal Difference Learning

This preview shows page 1-2-3-4-5 out of 14 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 14 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Temporal Difference LearningReinforcement LearningValue FunctionSlide 4Forward view of TD()Backward view of TD()Backward view continuedPowerPoint PresentationFunction ApproximationMy ImplementationHill Climbing ProblemHill Climbing ResultsGamesImplementationTemporal Difference LearningBy John LenzReinforcement Learning•Agent interacting with environment•Agent receives reward signal based on previous action.•Goal is to maximize reward signal over time•Epoch vs. Continuous tasks•No expert teacherValue Function•Actual return, Rt, is the sum of all future rewards.•Value function: V(s) = E{Rt | st = s} is the expected return starting from state st and following policy .•State-action value function, Q(s,a), is the expected return starting from state s, taking action a, and then following policy .Temporal Difference Learning•Constant- Monty Carlo method. Update weights only when the actual return is known. V(s) = [Rt – V(s)]•Must wait until actual return is known to update weights. End of epoch•Instead, we estimate the actual return as the sum of the next reward plus the value of the next stateForward view of TD()•Estimate the actual return by n-step return Rt(n) = rt+1 +  rt+2 + … + n-1 rt+n + n V(st+n) •TD() is a weighted average of n-step returns Rt = (1 - ) n (n-1) Rt(n)=0, we only look at next reward=1, constant- Monty CarloBackward view of TD()•Each state has an eligibility trace. The eligibility trace is incremented by one for the current state, and decayed by  •Then, each time-step, the value function for every recently visited state is updated. The eligibility trace determines which states have been recently visitedBackward view continued•et(s) =  et-1(s) if s != st•et(s) =  et-1(s) + 1 if s = stt = rt+1 + Vt(st+1) - Vt(st)Vt(s) =  t et(s) for all sControl Problem: Sarsa()-Greedy policy using Q(s,a) instead of V(s)•Each state-action pair has an eligibility trace e(s,a)Function Approximation•Until now, assumed V(s) and Q(s,a) implemented as huge table•Instead approximate the Value function V(s) and the state-action value function Q(s,a) with any supervised learning methods•Radial Basis Networks, Support Vector Machines, Artificial Neural Networks, clustering, etc.My Implementation•C++•Uses a two-layer feed forward artificial neural network for function approximation•Agent implemented on top of neural network, which implements the learning algorithm•Agent is independent of environment.Hill Climbing Problem•Goal is to reach the top of the hill, but the car is unable to accelerate to the top•Car must move away from the goal to gain momentum to reach the top.Hill Climbing ResultsInitial run took 65,873 steps but by the ninth epoch took 186 stepsGames•N-in-a-row type games like checkers, tic-tac-toe, backgammon, chess, etc.•Use the after-state value function to select moves•TD-Gammon plays at level of best human players•Can learn through self playImplementation•I implemented tic-tac-toe and checkers.•After around 30,000 games of self play, the tic-tac-toe program could learn to play a descent game.•Checkers was less successful. Even after 400,000 self play games the agent could not beat a traditional


View Full Document

UW-Madison ECE 539 - Temporal Difference Learning

Documents in this Course
Load more
Download Temporal Difference Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Temporal Difference Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Temporal Difference Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?