DOC PREVIEW
CMU CS 10701 - Notes

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine Learning 10 701 Tom M Mitchell Machine Learning Department Carnegie Mellon University April 26 2011 Today Readings Learning of control policies Markov Decision Processes Temporal difference learning Q learning Mitchell chapter 13 Kaelbling et al Reinforcement Learning A Survey Thanks to Aarti Singh for several slides Tom Mitchell April 2011 Reinforcement Learning Sutton and Barto 1981 Samuel 1957 Tom Mitchell April 2011 1 Reinforcement Learning Backgammon Learning task chose move at arbitrary board states Tessauro 1995 Training signal final win or loss Training played 300 000 games against itself Algorithm reinforcement learning neural network Result World class Backgammon player Tom Mitchell April 2011 Outline Learning control strategies Credit assignment and delayed reward Discounted rewards Markov Decision Processes Solving a known MDP Online learning of control strategies When next state function is known value function V s When next state function unknown learning Q s a Role in modeling reward learning in animals Tom Mitchell April 2011 2 Tom Mitchell April 2011 Markov Decision Process Reinforcement Learning Setting Set of states S Set of actions A At each time agent observes state st S then chooses action at A Then receives reward rt and state changes to st 1 Markov assumption P st 1 st at st 1 at 1 P st 1 st at Also assume reward Markov P rt st at st 1 at 1 P rt st at The task learn a policy S A for choosing actions that maximizes for every possible starting state s0 Tom Mitchell April 2011 3 HMM Markov Process Markov Decision Process Tom Mitchell April 2011 HMM Markov Process Markov Decision Process Tom Mitchell April 2011 4 Reinforcement Learning Task for Autonomous Agent Execute actions in environment observe results and Learn control policy S A that maximizes from every state s S Example Robot grid world deterministic reward r s a Tom Mitchell April 2011 Reinforcement Learning Task for Autonomous Agent Execute actions in environment observe results and Learn control policy S A that maximizes from every state s S Yikes Function to be learned is S A But training examples are not of the form s a They are instead of the form s a r Tom Mitchell April 2011 5 Value Function for each Policy Given a policy S A define assuming action sequence chosen according to starting at state s Then we want the optimal policy where For any MDP such a policy exists We ll abbreviate V s as V s Note if we have V s and P st 1 st a we can compute s Tom Mitchell April 2011 Value Function what are the V s values Tom Mitchell April 2011 6 Value Function what are the V s values Tom Mitchell April 2011 Immediate rewards r s a State values V s Tom Mitchell April 2011 7 Recursive definition for V S assuming actions are chosen according to the optimal policy Tom Mitchell April 2011 Value Iteration for learning V assumes P St 1 St A known Initialize V s arbitrarily Loop until policy good enough Loop for s in S Loop for a in A End loop End loop V s converges to V s Dynamic programming Tom Mitchell April 2011 8 Value Iteration Interestingly value iteration works even if we randomly traverse the environment instead of looping through each state and action methodically but we must still visit each state infinitely often on an infinite run For details Bertsekas 1989 Implications online learning as agent randomly roams If max over states difference between two successive value function estimates is less than then the value of the greedy policy differs from the optimal policy by no more than Tom Mitchell April 2011 So far learning optimal policy when we know P st st 1 at 1 What if we don t Tom Mitchell April 2011 9 Q learning Define new function closely related to V If agent knows Q s a it can choose optimal action without knowing P st 1 st a And it can learn Q without knowing P st 1 st a Tom Mitchell April 2011 Immediate rewards r s a State values V s State action values Q s a Bellman equation Consider first the case where P s s a is deterministic Tom Mitchell April 2011 10 Tom Mitchell April 2011 Tom Mitchell April 2011 11 Tom Mitchell April 2011 Use general fact Tom Mitchell April 2011 12 Tom Mitchell April 2011 Tom Mitchell April 2011 13 Tom Mitchell April 2011 MDP s and RL What You Should Know Learning to choose optimal actions A From delayed reward By learning evaluation functions like V S Q S A Key ideas If next state function St x At St 1 is known can use dynamic programming to learn V S once learned choose action At that maximizes V St 1 If next state function St x At St 1 unknown learn Q St At E V St 1 to learn sample St x At St 1 in actual world once learned choose action At that maximizes Q St At Tom Mitchell April 2011 14 MDPs and Reinforcement Learning Further Issues What strategy for choosing actions will optimize learning rate explore uninvestigated states obtained reward exploit what you know so far Partially observable Markov Decision Processes state is not fully observable maintain probability distribution over possible states you re in Convergence guarantee with function approximators our proof assumed a tabular representation for Q V some types of function approximators still converge e g nearest neighbor Gordon 1999 Correspondence to human learning Tom Mitchell April 2011 15


View Full Document

CMU CS 10701 - Notes

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?