DOC PREVIEW
Rutgers University CS 536 - Chapter 13: Reinforcement Learning

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Chapter 13:Reinforcement LearningCS 536: Machine LearningLittman (Wu, TA)AdministrationMidterms due2Reinforcement Learning[Read Chapter 13]• [Exercises 13.1, 13.2, 13.4]• Control learning• Control policies that chooseoptimal actions• Q learning• ConvergenceControl LearningConsider learning to choose actions,e.g.,• Robot learning to dock on batterycharger• Learning to choose actions tooptimize factory output• Learning to play Backgammon3Problem CharacteristicsNote several problem characteristics:• Delayed reward• Opportunity for active exploration• Possibility that state only partiallyobservable• Possible need to learn multipletasks with same sensors/effectorsOne Example: TD-Gammon[Tesauro, 1995]Learn to play BackgammonImmediate reward• +100 if win• -100 if lose• 0 for all other statesTrained by playing 1.5 million gamesagainst itselfNow approximately equal to best humanplayer4The RL ProblemGoal: Learn to choose actions that maximizer0 + g r1 + g2 r2 + ... , where 0 ≤ g < 1Markov Decision ProcessesAssume• finite set of states S; set of actions A• at each discrete time agent observesstate st in S and chooses action at in A• then receives immediate reward rt &state changes to st+1• Markov assumption:–rt = r(st, at) and st+1 = d(st, at) depend only oncurrent state and action– d and r may be nondeterministic– d and r not necessarily known to agent5Agent's Learning TaskExecute actions in environment,observe results, and• learn action policy p: S Æ A thatmaximizesE[rt + g rt+1 + g2 rt+2 + …]from any starting state in S• here 0 ≤ g < 1 is the discount factorfor future rewardsDifferent Learning ProblemNote something new:• Target function is p: S Æ A• but we have no training examplesof form <s, a>• training examples are of form <<s, a>, r>6Value FunctionTo begin, consider deterministic worlds...For each possible policy p the agent mightadopt, we can define an evaluationfunction over statesVp(s) ≡ rt+ g rt+1 + g2 rt+2 + …≡ Si=0• gi rt+iwhere rt, rt+1, … are generated by followingpolicy p starting at state sRestated, the task is to learn the optimalpolicy p*: p* ≡ argmaxp Vp(s), ("s).Example MDP7What to LearnWe might try to have agent learn theevaluation function Vp* (we write as V*)It could then do a lookahead search tochoose best action from any state sbecausep*(s) = argmaxa [r (s, a) + g V*(d(s, a))]A problem:• This works well if agent knows d: S¥AÆS,and r: S¥AÆ ¬• But when it doesn't, it can't chooseactions this wayQ FunctionDefine new function very similar to V*Q(s, a) ≡ r(s, a) + g V*(d(s, a))]If agent learns Q, it can chooseoptimal action even withoutknowing d!p*(s) = argmaxa [r (s, a) + g V*(d(s, a))]p*(s) = argmaxa Q(s, a)Q is the evaluation function the agentwill learn8Training Rule to Learn QNote Q and V* closely related:V*(s) = maxa’ Q(s, a’)Which allows us to write Q recursively asQ(st, at) = r(st, at) + g V*(d(st, at))= r(st, at) + g maxa’ Q(st+1, a’)Nice! Let Q denote learner's currentapproximation to Q. Use training ruleQ(s, a) ¨ r + g maxa’ Q(s’, a’)where s’ is the state resulting fromapplying action a in state s^^^Q Learning in Deterministic CaseFor each s, a initialize table entryQ(s, a) ¨ 0Observe current state sDo forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q(s, a) via:Q(s, a) ¨ r + g maxa’ Q(s’, a’)•s ¨ s’^^^^9Updating QQ(s1, aright) ¨ r + g maxa’ Q(s2, a’)¨ 0 + 0.9 max {63, 81, 100} = 90notice if rewards non-negative, then("s, a, n) Qn+1(s, a) ≥ Qn(s, a)and("s, a, n) 0 ≤ Qn(s, a) ≤ Q(s, a)^^^^ ^^Convergence ProofQ converges to Q. Consider case ofdeterministic world where see each<s, a> visited infinitely often.Proof: Define a full interval to be an intervalduring which each <s, a> is visited.During each full interval the largest errorin Q table is reduced by factor of gLet Qn be table after n updates, and n bethe maximum error in Qn; that isDn = maxs,a |Qn(s, a) - Q (s, a) |^^^^^10Proof ContinuedFor table entry Qn(s, a) updated on iteration n +1,the error in the revised estimate Qn+1(s, a) is|Qn+1(s, a) - Q(s, a)|= |(r + g maxa’ Qn(s’, a’)) - (r + g maxa’ Q(s’, a’))|= g | maxa’ Qn(s’, a’) - maxa’ Q(s’, a’)|≤ g maxa’ | Qn(s’, a’) - Q(s’, a’)|≤ g maxa’,s’’ | Qn(s’’, a’) - Q(s’’, a’)||Qn+1(s, a) - Q (s, a) | ≤ g DnNote that we used the fact that| maxa f1(a) - maxa f2(a)| ≤ maxa | f1(a) - f2(a)|^^^^^^^^Nondeterministic CaseWhat if reward and next state arenon-deterministic?We redefine V,Q by taking expectedvaluesVp(s) ≡ E[rt + g rt+1 + g2 rt+2 + …]≡ E[Si=0• gi rt+i]Q(s, a) ≡ E[r(s, a) + gV*(d(s, a))]11Nondeterministic CaseQ learning generalizes tonondeterministic worldsAlter training rule toQn(s, a) ¨ (1-an)Qn-1(s,a) + an [r + g maxa’ Qn-1(s’, a’)]wherean = 1/(1 + visitsn(s, a)).Can still prove convergence of Q to Q[Watkins and Dayan, 1992].^^^^Temporal Difference LearningQ learning: reduce discrepancy between successiveQ estimatesOne step time difference:Q(1)(st,at) ≡ rt + g maxa Q(st+1,a)Why not 2 steps?Q(2)(st,at) ≡ rt + g rt+1 + g2 maxa Q(st+2,a)Or n ?Q(n)(st,at) ≡ rt + … + g(n-1) rt+n-1 + gn maxa Q(st+n,a)Blend all of these:Q l(st,at) ≡ (1-l) [Q(1)(st,at) + lQ(2)(st,at) + … ]12Temporal Difference LearningQ l(st,at) ≡ (1-l) [Q(1)(st,at) + lQ(2)(st,at) + …]Equivalent expression:Q l(st,at) ≡ rt+g [(1-l) maxa Q(st,at) + lQ l(st+1,at+1)]TD(l) algorithm uses above training rule• Sometimes converges faster than Qlearning (not understood in control case)• converges for learning V for any 0≤l≤1(Dayan, 1992)• Tesauro's TD-Gammon uses thisalgorithm to estimate the value functionvia self play.^Subtleties & Ongoing Research• Replace Q table with neural net or othergeneralizer• Handle case where state only partiallyobservable• Design optimal exploration strategies• Extend to continuous action, state• Learn and use d: S¥AÆS• Relationship to dynamic programmingand heuristic


View Full Document
Download Chapter 13: Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 13: Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 13: Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?