CMU CS 15381 - Reinforcement learning 2 - D2485403

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15381> Reinforcement learning 2

DOC PREVIEW

CMU CS 15381 - Reinforcement learning 2

School name Carnegie Mellon University

Course Cs 15381- Artificial Intelligence: Representation and Problem Solving

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Artificial Intelligence: Representation and Problem Solving15-381May 3, 2007Reinforcement Learning 2Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Announcements again•Don’t forget: fill out FCE’swww.cmu.edu/fce•Review session for final: Tuesday, May 86:00 - 8:00 pmWean 46232Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Reinforcement learning so far•Goal is the same as MDPs: discover policy that maximizes reward•But, unlike MDPs, we don’t know: state space (world), transition model, or reward function-only know what state you’re in and the available actions•Need to explore state space to discover best policy•Model estimation:-do a set of trials (each that runs until a terminal state is reached)-estimate the rewards and a transition probabilities using the counts across trials•Now use estimates in MDP policy estimation techniques3????T (s, a, s!) = ?R(s) = ?3211234Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Evaluating value function from estimates•Problem: how long do we wait before estimating V(s) (in order to estimating policies)?•If we want to adapt our policy or use it for more efficient exploration, we need to have an up-to-date estimate•Estimating V(s) is expensive if done at each iteration.•Ideas for improvement: -“one backup” only update the value estimate for state si (ie one state back) instead of all-Prioritized sweeping: use priority queue to update states with large potential for change4????T (s, a, s!) = ?R(s) = ?3211234Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Direct estimation of the value function•Note, that we could also use the trials to estimate the value function directly:-at the end of each trial, calculate the observed future reward for each state-get estimates for V(s) for each state by averaging over trials, and multiple visits within a trial.•Also called Monte Carlo estimation, since it samples randomly using the policy !(s).•Utility of each state equals its own reward + the expected utility of successor states:•But, this is also inefficient than the other methods and learning only occurs at the end of each trial.5????T (s, a, s!) = ?R(s) = ?3211234V (s) = R(s) + γ!s!T (s, π(s), s!)V (s!)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Temporal distance learning•After each trial we will have noisy estimates of the true values V(s)•Consider transition (3,2) "(3,3) and its two noisy estimates.•What value do we expect for V(3,2)?V(3,2) = -0.04 + V(3,3) = -0.04 + 0.94 = 0.90•The observed value is lower than the current estimate, so V(3,3) is too high.•Why don’t we adjust it to make it more consistent with our new observation?630.850.94211234two current estimates of V(s)30.8120.8680.912+120.7620.660-110.7050.6550.6110.3881234true values of V(s)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2•For any successor state s’, we can adjust V(s) so it better agrees with the constraint equations, ie closer to the expected future rewards.•Note: this doesn’t involve the transitions T(s,a,s’). We only update the values V(s).•Not quite as accurate as earlier methods, but:-much simpler and requires much less computation•Why “temporal” ? In terms of a state sequence:Temporal difference learning7current valueexpected future rewardNew estimate is weighted sum of ˆV (s) ← (1 − α)ˆV (s) + α(R(s) + γˆV (s!))=ˆV (s) + α(R(s) + γˆV (s!) −ˆV (s))current valuedifference between current value and new estimate after transitioning to s’.The TD (or temporal difference) equation+V (st) ← V (st) + α [Rt+ γV (st+1) − V (st)]Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2TD algorithm1. initialize V(s) (arbitrarily)2. for each trial3. initialize s4. for each step in trial5. observe reward R(s); take action a = !(s); observe next state s’6. 7. 8. until s is terminal8V (s) ← V (s) + α [R(s) + γV (s!) − V (s)]s ← s!Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2Example: driving home (from Sutton and Barto)9StateElapsed Time(minutes)PredictedTime to GoPredictedTotal Timeleaving office, friday at 6"03030reach car, raining"53540exiting highway2015352ndary road, behind truck301040entering home street40"343arrive home43"043actualoutcomeSituation30354045Predictedtotaltravelti meroadleavingofficeexitinghighway2ndary home arriverea chcar street home•rewards are elapsed time•you update your predicted total time at each state•TD learning shifts each estimate toward the next estimate•Each error is proportional to the temporal differences in the predictionsMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2•Compare the TD update equation (assuming policy !(s) )•to the direct method (assuming the same policy)•The TD update can be viewed as an approximation that simply chooses an update of the observed successor s’, rather than in the direct method that weights all possible successor according to their probability.•For large numbers of transitions, the two will be equal, since the successor states will occur in proportion to their transition probability.Compare TD to the direct method10V (s) ← V (s) + α(R(s) + γV (s!) − V (s))= (1 − α)V (s) + α(R(s) + γV (s!))= R(s) + γV (s!) for α → 1V (s) = R(s) + γ!s!T (s, π(s), s!)V (s!)• • • • • •ss!Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2The temporal difference equation•How do we choose # (the learning rate)?•It’s a weighted sum, so 0 < # < 1.•Too small: V(s) will converge slowly, biased toward current estimate•Too large: V(s) will change too quickly, biased toward new estimate•Decaying learning rate strategy: start large when we’re not sure, then reduce as we become more sure of our estimates, eg #t = 1/t.11∞!t=1αt= ∞∞!t=1α2t< ∞Theorem: for V(s) to converge to correct value #t must satisfy:V (s) ← V (s) + α(R(s) + γV (s!) − V (s))t!tMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Reinforcement Learning 2What about policies?•Previously, we estimated the policy using MDP approaches, e.g. via value iteration•But, this involves estimating the transition

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 13 pages.

CMU CS 15381 - Reinforcement learning 2

Sign up for free to view:

Please select your school