Rutgers University CS 536 - INTRODUCTION TO Machine Learning

Unformatted text preview:

INTRODUCTIONTOMachineLearningETHEM ALPAYDIN© The MIT Press, [email protected]://www.cmpe.boun.edu.tr/~ethem/i2mlLecture Slides forCHAPTER16:ReinforcementLearningLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)3Introduction Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a goal Agent has a state in an environment, takes an action and sometimes receives reward and the state changes Credit-assignment Learn a policyLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)4Single State: K-armed Bandit Among K levers, choose the one that pays bestQ(a): value of action aReward is raSet Q(a) = raChoose a*if Q(a*)=maxaQ(a) Rewards stochastic (keep an expected reward):Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)5Elements of RL (Markov Decision Processes) st: State of agent at time t at: Action taken at time t In st, action atis taken, clock ticks and reward rt+1is received and state changes to st+1 Next state prob: P (st+1| st, at ) Reward prob: p (rt+1| st, at ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal (Sutton and Barto, 1998; Kaelbling et al., 1996)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)6Policy and Cumulative Reward Policy, Value of a policy, Finite-horizon: Infinite horizon:Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)7Value of atin stBellman’s equationLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)8Model-Based Learning Environment, P (st+1| st, at ), p (rt+1| st, at ), is known There is no need for exploration Can be solved using dynamic programming Solve for Optimal policyLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)9Value IterationLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)10Policy IterationLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)11Temporal Difference Learning Environment, P (st+1| st, at ), p (rt+1| st, at ), is not known; model-free learning There is need for exploration to sample from P (st+1| st, at ) and p (rt+1| st, at ) Use the reward received in the next time step to update the value of current state (action) The temporal difference between the value of the current action and the value discounted from the next stateLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)12Exploration Strategies ε-greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1-ε Probabilistic: Move smoothly from exploration/exploitation.  Decrease ε AnnealingLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)13Deterministic Rewards and Actions Deterministic: single possible reward and next stateused as an update rule (backup)Starting at zero, Q values increase, never decreaseLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)14Consider the value of action marked by ‘*’:If path A is seen first, Q(*)=0.9*max(0,81)=73Then B is seen, Q(*)=0.9*max(100,81)=90Or,If path B is seen first, Q(*)=0.9*max(100,0)=90Then A is seen, Q(*)=0.9*max(100,81)=90Q values increase but never decreaseγ=0.9Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)15Nondeterministic Rewards and Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning (Watkins and Dayan, 1992): Off-policy vs on-policy (Sarsa) Learning V (TD-learning: Sutton, 1988)backupLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)16Q-learningLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)17SarsaLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)18Eligibility Traces Keep a record of previously visited states (actions)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)19Sarsa (λ)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)20Generalization Tabular: Q (s , a) or V (s) stored in a table Regressor: Use a learner to estimate Q (s , a) or V (s) EligibilityLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)21Partially Observable States The agent does not know its state but receives an observation which can be used to infer a belief about states Partially observable


View Full Document
Download INTRODUCTION TO Machine Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view INTRODUCTION TO Machine Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view INTRODUCTION TO Machine Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?