1Chapter 13:Reinforcement LearningCS 536: Machine LearningLittman (Wu, TA)AdministrationMidterms due2Reinforcement Learning[Read Chapter 13]• [Exercises 13.1, 13.2, 13.4]• Control learning• Control policies that chooseoptimal actions• Q learning• ConvergenceControl LearningConsider learning to choose actions,e.g.,• Robot learning to dock on batterycharger• Learning to choose actions tooptimize factory output• Learning to play Backgammon3Problem CharacteristicsNote several problem characteristics:• Delayed reward• Opportunity for active exploration• Possibility that state only partiallyobservable• Possible need to learn multipletasks with same sensors/effectorsOne Example: TD-Gammon[Tesauro, 1995]Learn to play BackgammonImmediate reward• +100 if win• -100 if lose• 0 for all other statesTrained by playing 1.5 million gamesagainst itselfNow approximately equal to best humanplayer4The RL ProblemGoal: Learn to choose actions that maximizer0 + g r1 + g2 r2 + ... , where 0 ≤ g < 1Markov Decision ProcessesAssume• finite set of states S; set of actions A• at each discrete time agent observesstate st in S and chooses action at in A• then receives immediate reward rt &state changes to st+1• Markov assumption:–rt = r(st, at) and st+1 = d(st, at) depend only oncurrent state and action– d and r may be nondeterministic– d and r not necessarily known to agent5Agent's Learning TaskExecute actions in environment,observe results, and• learn action policy p: S Æ A thatmaximizesE[rt + g rt+1 + g2 rt+2 + …]from any starting state in S• here 0 ≤ g < 1 is the discount factorfor future rewardsDifferent Learning ProblemNote something new:• Target function is p: S Æ A• but we have no training examplesof form <s, a>• training examples are of form <<s, a>, r>6Value FunctionTo begin, consider deterministic worlds...For each possible policy p the agent mightadopt, we can define an evaluationfunction over statesVp(s) ≡ rt+ g rt+1 + g2 rt+2 + …≡ Si=0• gi rt+iwhere rt, rt+1, … are generated by followingpolicy p starting at state sRestated, the task is to learn the optimalpolicy p*: p* ≡ argmaxp Vp(s), ("s).Example MDP7What to LearnWe might try to have agent learn theevaluation function Vp* (we write as V*)It could then do a lookahead search tochoose best action from any state sbecausep*(s) = argmaxa [r (s, a) + g V*(d(s, a))]A problem:• This works well if agent knows d: S¥AÆS,and r: S¥AÆ ¬• But when it doesn't, it can't chooseactions this wayQ FunctionDefine new function very similar to V*Q(s, a) ≡ r(s, a) + g V*(d(s, a))]If agent learns Q, it can chooseoptimal action even withoutknowing d!p*(s) = argmaxa [r (s, a) + g V*(d(s, a))]p*(s) = argmaxa Q(s, a)Q is the evaluation function the agentwill learn8Training Rule to Learn QNote Q and V* closely related:V*(s) = maxa’ Q(s, a’)Which allows us to write Q recursively asQ(st, at) = r(st, at) + g V*(d(st, at))= r(st, at) + g maxa’ Q(st+1, a’)Nice! Let Q denote learner's currentapproximation to Q. Use training ruleQ(s, a) ¨ r + g maxa’ Q(s’, a’)where s’ is the state resulting fromapplying action a in state s^^^Q Learning in Deterministic CaseFor each s, a initialize table entryQ(s, a) ¨ 0Observe current state sDo forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q(s, a) via:Q(s, a) ¨ r + g maxa’ Q(s’, a’)•s ¨ s’^^^^9Updating QQ(s1, aright) ¨ r + g maxa’ Q(s2, a’)¨ 0 + 0.9 max {63, 81, 100} = 90notice if rewards non-negative, then("s, a, n) Qn+1(s, a) ≥ Qn(s, a)and("s, a, n) 0 ≤ Qn(s, a) ≤ Q(s, a)^^^^ ^^Convergence ProofQ converges to Q. Consider case ofdeterministic world where see each<s, a> visited infinitely often.Proof: Define a full interval to be an intervalduring which each <s, a> is visited.During each full interval the largest errorin Q table is reduced by factor of gLet Qn be table after n updates, and n bethe maximum error in Qn; that isDn = maxs,a |Qn(s, a) - Q (s, a) |^^^^^10Proof ContinuedFor table entry Qn(s, a) updated on iteration n +1,the error in the revised estimate Qn+1(s, a) is|Qn+1(s, a) - Q(s, a)|= |(r + g maxa’ Qn(s’, a’)) - (r + g maxa’ Q(s’, a’))|= g | maxa’ Qn(s’, a’) - maxa’ Q(s’, a’)|≤ g maxa’ | Qn(s’, a’) - Q(s’, a’)|≤ g maxa’,s’’ | Qn(s’’, a’) - Q(s’’, a’)||Qn+1(s, a) - Q (s, a) | ≤ g DnNote that we used the fact that| maxa f1(a) - maxa f2(a)| ≤ maxa | f1(a) - f2(a)|^^^^^^^^Nondeterministic CaseWhat if reward and next state arenon-deterministic?We redefine V,Q by taking expectedvaluesVp(s) ≡ E[rt + g rt+1 + g2 rt+2 + …]≡ E[Si=0• gi rt+i]Q(s, a) ≡ E[r(s, a) + gV*(d(s, a))]11Nondeterministic CaseQ learning generalizes tonondeterministic worldsAlter training rule toQn(s, a) ¨ (1-an)Qn-1(s,a) + an [r + g maxa’ Qn-1(s’, a’)]wherean = 1/(1 + visitsn(s, a)).Can still prove convergence of Q to Q[Watkins and Dayan, 1992].^^^^Temporal Difference LearningQ learning: reduce discrepancy between successiveQ estimatesOne step time difference:Q(1)(st,at) ≡ rt + g maxa Q(st+1,a)Why not 2 steps?Q(2)(st,at) ≡ rt + g rt+1 + g2 maxa Q(st+2,a)Or n ?Q(n)(st,at) ≡ rt + … + g(n-1) rt+n-1 + gn maxa Q(st+n,a)Blend all of these:Q l(st,at) ≡ (1-l) [Q(1)(st,at) + lQ(2)(st,at) + … ]12Temporal Difference LearningQ l(st,at) ≡ (1-l) [Q(1)(st,at) + lQ(2)(st,at) + …]Equivalent expression:Q l(st,at) ≡ rt+g [(1-l) maxa Q(st,at) + lQ l(st+1,at+1)]TD(l) algorithm uses above training rule• Sometimes converges faster than Qlearning (not understood in control case)• converges for learning V for any 0≤l≤1(Dayan, 1992)• Tesauro's TD-Gammon uses thisalgorithm to estimate the value functionvia self play.^Subtleties & Ongoing Research• Replace Q table with neural net or othergeneralizer• Handle case where state only partiallyobservable• Design optimal exploration strategies• Extend to continuous action, state• Learn and use d: S¥AÆS• Relationship to dynamic programmingand heuristic
View Full Document