Rutgers University CS 536 - INTRODUCTION TO Machine Learning - D2466737

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> INTRODUCTION TO Machine Learning

Rutgers University CS 536 - INTRODUCTION TO Machine Learning

School name Rutgers University- The State University of New Jersey

Pages 21

Download Save

Unformatted text preview:

INTRODUCTIONTOMachineLearningETHEM ALPAYDIN© The MIT Press, [email protected]://www.cmpe.boun.edu.tr/~ethem/i2mlLecture Slides forCHAPTER16:ReinforcementLearningLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)3Introduction Game-playing: Sequence of moves to win a game Robot in a maze: Sequence of actions to find a goal Agent has a state in an environment, takes an action and sometimes receives reward and the state changes Credit-assignment Learn a policyLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)4Single State: K-armed Bandit Among K levers, choose the one that pays bestQ(a): value of action aReward is raSet Q(a) = raChoose a*if Q(a*)=maxaQ(a) Rewards stochastic (keep an expected reward):Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)5Elements of RL (Markov Decision Processes) st: State of agent at time t at: Action taken at time t In st, action atis taken, clock ticks and reward rt+1is received and state changes to st+1 Next state prob: P (st+1| st, at ) Reward prob: p (rt+1| st, at ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal (Sutton and Barto, 1998; Kaelbling et al., 1996)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)6Policy and Cumulative Reward Policy, Value of a policy, Finite-horizon: Infinite horizon:Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)7Value of atin stBellman’s equationLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)8Model-Based Learning Environment, P (st+1| st, at ), p (rt+1| st, at ), is known There is no need for exploration Can be solved using dynamic programming Solve for Optimal policyLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)9Value IterationLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)10Policy IterationLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)11Temporal Difference Learning Environment, P (st+1| st, at ), p (rt+1| st, at ), is not known; model-free learning There is need for exploration to sample from P (st+1| st, at ) and p (rt+1| st, at ) Use the reward received in the next time step to update the value of current state (action) The temporal difference between the value of the current action and the value discounted from the next stateLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)12Exploration Strategies ε-greedy: With pr ε,choose one action at random uniformly; and choose the best action with pr 1-ε Probabilistic: Move smoothly from exploration/exploitation.  Decrease ε AnnealingLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)13Deterministic Rewards and Actions Deterministic: single possible reward and next stateused as an update rule (backup)Starting at zero, Q values increase, never decreaseLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)14Consider the value of action marked by ‘*’:If path A is seen first, Q(*)=0.9*max(0,81)=73Then B is seen, Q(*)=0.9*max(100,81)=90Or,If path B is seen first, Q(*)=0.9*max(100,0)=90Then A is seen, Q(*)=0.9*max(100,81)=90Q values increase but never decreaseγ=0.9Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)15Nondeterministic Rewards and Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning (Watkins and Dayan, 1992): Off-policy vs on-policy (Sarsa) Learning V (TD-learning: Sutton, 1988)backupLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)16Q-learningLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)17SarsaLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)18Eligibility Traces Keep a record of previously visited states (actions)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)19Sarsa (λ)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)20Generalization Tabular: Q (s , a) or V (s) stored in a table Regressor: Use a learner to estimate Q (s , a) or V (s) EligibilityLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.0)21Partially Observable States The agent does not know its state but receives an observation which can be used to infer a belief about states Partially observable

View Full Document


School:
Email:
New Password:
Confirm Password:

Rutgers University CS 536 - INTRODUCTION TO Machine Learning

Sign up for free to view:

Please select your school