UW-Madison ECE 539 - Temporal Difference Learning - D2581865

Home> Schools> University of Wisconsin, Madison> Electrical and Computer Engr (ECE) > ECE 539> Temporal Difference Learning

DOC PREVIEW

UW-Madison ECE 539 - Temporal Difference Learning

School name University of Wisconsin, Madison

Course Ece 539- Introduction to Artificial Neural Network and Fuzzy Systems

Pages 14

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Temporal Difference LearningReinforcement LearningValue FunctionSlide 4Forward view of TD()Backward view of TD()Backward view continuedPowerPoint PresentationFunction ApproximationMy ImplementationHill Climbing ProblemHill Climbing ResultsGamesImplementationTemporal Difference LearningBy John LenzReinforcement Learning•Agent interacting with environment•Agent receives reward signal based on previous action.•Goal is to maximize reward signal over time•Epoch vs. Continuous tasks•No expert teacherValue Function•Actual return, Rt, is the sum of all future rewards.•Value function: V(s) = E{Rt | st = s} is the expected return starting from state st and following policy .•State-action value function, Q(s,a), is the expected return starting from state s, taking action a, and then following policy .Temporal Difference Learning•Constant- Monty Carlo method. Update weights only when the actual return is known. V(s) = [Rt – V(s)]•Must wait until actual return is known to update weights. End of epoch•Instead, we estimate the actual return as the sum of the next reward plus the value of the next stateForward view of TD()•Estimate the actual return by n-step return Rt(n) = rt+1 +  rt+2 + … + n-1 rt+n + n V(st+n) •TD() is a weighted average of n-step returns Rt = (1 - ) n (n-1) Rt(n)=0, we only look at next reward=1, constant- Monty CarloBackward view of TD()•Each state has an eligibility trace. The eligibility trace is incremented by one for the current state, and decayed by  •Then, each time-step, the value function for every recently visited state is updated. The eligibility trace determines which states have been recently visitedBackward view continued•et(s) =  et-1(s) if s != st•et(s) =  et-1(s) + 1 if s = stt = rt+1 + Vt(st+1) - Vt(st)Vt(s) =  t et(s) for all sControl Problem: Sarsa()-Greedy policy using Q(s,a) instead of V(s)•Each state-action pair has an eligibility trace e(s,a)Function Approximation•Until now, assumed V(s) and Q(s,a) implemented as huge table•Instead approximate the Value function V(s) and the state-action value function Q(s,a) with any supervised learning methods•Radial Basis Networks, Support Vector Machines, Artificial Neural Networks, clustering, etc.My Implementation•C++•Uses a two-layer feed forward artificial neural network for function approximation•Agent implemented on top of neural network, which implements the learning algorithm•Agent is independent of environment.Hill Climbing Problem•Goal is to reach the top of the hill, but the car is unable to accelerate to the top•Car must move away from the goal to gain momentum to reach the top.Hill Climbing ResultsInitial run took 65,873 steps but by the ninth epoch took 186 stepsGames•N-in-a-row type games like checkers, tic-tac-toe, backgammon, chess, etc.•Use the after-state value function to select moves•TD-Gammon plays at level of best human players•Can learn through self playImplementation•I implemented tic-tac-toe and checkers.•After around 30,000 games of self play, the tic-tac-toe program could learn to play a descent game.•Checkers was less successful. Even after 400,000 self play games the agent could not beat a traditional

View Full Document

UW-Madison ECE 539 - Temporal Difference Learning

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 14 pages.

UW-Madison ECE 539 - Temporal Difference Learning

Sign up for free to view:

Please select your school