DOC PREVIEW
CMU CS 15381 - Reinforcement learning 2

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Artificial Intelligence: Representation and Problem Solving15-381May 3, 2007Reinforcement Learning 2Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Announcements again•Don’t forget: fill out FCE’swww.cmu.edu/fce•Review session for final: Tuesday, May 86:00 - 8:00 pmWean 46232Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Reinforcement learning so far•Goal is the same as MDPs: discover policy that maximizes reward•But, unlike MDPs, we don’t know: state space (world), transition model, or reward function-only know what state you’re in and the available actions•Need to explore state space to discover best policy•Model estimation:-do a set of trials (each that runs until a terminal state is reached)-estimate the rewards and a transition probabilities using the counts across trials•Now use estimates in MDP policy estimation techniques3????T (s, a, s!) = ?R(s) = ?3211234Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Evaluating value function from estimates•Problem: how long do we wait before estimating V(s) (in order to estimating policies)?•If we want to adapt our policy or use it for more efficient exploration, we need to have an up-to-date estimate•Estimating V(s) is expensive if done at each iteration.•Ideas for improvement: -“one backup” only update the value estimate for state si (ie one state back) instead of all-Prioritized sweeping: use priority queue to update states with large potential for change4????T (s, a, s!) = ?R(s) = ?3211234Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Direct estimation of the value function•Note, that we could also use the trials to estimate the value function directly:-at the end of each trial, calculate the observed future reward for each state-get estimates for V(s) for each state by averaging over trials, and multiple visits within a trial.•Also called Monte Carlo estimation, since it samples randomly using the policy !(s).•Utility of each state equals its own reward + the expected utility of successor states:•But, this is also inefficient than the other methods and learning only occurs at the end of each trial.5????T (s, a, s!) = ?R(s) = ?3211234V (s) = R(s) + γ!s!T (s, π(s), s!)V (s!)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Temporal distance learning•After each trial we will have noisy estimates of the true values V(s)•Consider transition (3,2) "(3,3) and its two noisy estimates.•What value do we expect for V(3,2)?V(3,2) = -0.04 + V(3,3) = -0.04 + 0.94 = 0.90•The observed value is lower than the current estimate, so V(3,3) is too high.•Why don’t we adjust it to make it more consistent with our new observation?630.850.94211234two current estimates of V(s)30.8120.8680.912+120.7620.660-110.7050.6550.6110.3881234true values of V(s)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2•For any successor state s’, we can adjust V(s) so it better agrees with the constraint equations, ie closer to the expected future rewards.•Note: this doesn’t involve the transitions T(s,a,s’). We only update the values V(s).•Not quite as accurate as earlier methods, but:-much simpler and requires much less computation•Why “temporal” ? In terms of a state sequence:Temporal difference learning7current valueexpected future rewardNew estimate is weighted sum of ˆV (s) ← (1 − α)ˆV (s) + α(R(s) + γˆV (s!))=ˆV (s) + α(R(s) + γˆV (s!) −ˆV (s))current valuedifference between current value and new estimate after transitioning to s’.The TD (or temporal difference) equation+V (st) ← V (st) + α [Rt+ γV (st+1) − V (st)]Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2TD algorithm1. initialize V(s) (arbitrarily)2. for each trial3. initialize s4. for each step in trial5. observe reward R(s); take action a = !(s); observe next state s’6. 7. 8. until s is terminal8V (s) ← V (s) + α [R(s) + γV (s!) − V (s)]s ← s!Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Example: driving home (from Sutton and Barto)9StateElapsed Time(minutes)PredictedTime to GoPredictedTotal Timeleaving office, friday at 6"03030reach car, raining"53540exiting highway2015352ndary road, behind truck301040entering home street40"343arrive home43"043actualoutcomeSituation30354045Predictedtotaltravelti meroadleavingofficeexitinghighway2ndary home arriverea chcar street home•rewards are elapsed time•you update your predicted total time at each state•TD learning shifts each estimate toward the next estimate•Each error is proportional to the temporal differences in the predictionsMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2•Compare the TD update equation (assuming policy !(s) )•to the direct method (assuming the same policy)•The TD update can be viewed as an approximation that simply chooses an update of the observed successor s’, rather than in the direct method that weights all possible successor according to their probability.•For large numbers of transitions, the two will be equal, since the successor states will occur in proportion to their transition probability.Compare TD to the direct method10V (s) ← V (s) + α(R(s) + γV (s!) − V (s))= (1 − α)V (s) + α(R(s) + γV (s!))= R(s) + γV (s!) for α → 1V (s) = R(s) + γ!s!T (s, π(s), s!)V (s!)• • • • • •ss!Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2The temporal difference equation•How do we choose # (the learning rate)?•It’s a weighted sum, so 0 < # < 1.•Too small: V(s) will converge slowly, biased toward current estimate•Too large: V(s) will change too quickly, biased toward new estimate•Decaying learning rate strategy: start large when we’re not sure, then reduce as we become more sure of our estimates, eg #t = 1/t.11∞!t=1αt= ∞∞!t=1α2t< ∞Theorem: for V(s) to converge to correct value #t must satisfy:V (s) ← V (s) + α(R(s) + γV (s!) − V (s))t!tMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2What about policies?•Previously, we estimated the policy using MDP approaches, e.g. via value iteration•But, this involves estimating the transition


View Full Document

CMU CS 15381 - Reinforcement learning 2

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download Reinforcement learning 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement learning 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement learning 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?