DOC PREVIEW
U of M CS 5541 - Reinforcement Learning

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Reinforcement Learning • Control learning•Control polices that choose optimal actions•Control polices that choose optimal actions• Q learning• ConvergenceCS 5541 Chapter 13 Reinforcement Learning 1Control LearningConsider learning to choose actions, e.g.,• Robot learning to dock on battery chargergyg• Learning to choose actions to optimize factory output• Learning to play BackgammonNote several problem characteristicsDl d d•Delayed reward• Opportunity for active explorationP ibilit th t t t l ti ll b bl•Possibility that state only partially observable• Possible need to learn multiple tasks with same sensors/effectorsCS 5751 Machine LearningChapter 13 Reinforcement Learning 2sensors/effectorsOne Example: TD-GammonTesauro, 1995Learn to play BackgammonImmediate rewardImmediate reward• +100 if win•100 if lose•-100 if lose• 0 for all other statesTrained by playing 1.5 million games against itselfNitlltbthlCS 5751 Machine LearningChapter 13 Reinforcement Learning 3Now approximately equal to best human playerReinforcement Learning ProblemEnvironmentstateactionrewardAgents0r0a0s1r1a1s2r2a2...Goal: learn to choose actions that maximizer0 + γr1 + γ2r2 + …, where 0 ≤γ< 1CS 5751 Machine LearningChapter 13 Reinforcement Learning 4Markov Decision ProcessAssume•finite set of statesSfinite set of states S• set of actions A•at each discrete time agent observes states∈Sat each discrete time, agent observes state st ∈S and choose action at ∈ A•then receives immediate rewardrtthen receives immediate reward rt• and state changes to st+1•Markov assumption:s1=δ(sa)andr=r(sa)Markov assumption: st+1 δ(st, at) and rt r(st, at)– i.e., rtand st+1depend only on current state and action– functions δ and r may be nondeterministicCS 5751 Machine LearningChapter 13 Reinforcement Learning 5– functions δand r no necessarily known to agentAgent’s Learning TaskExecute action in environment, observe results, and•learn action policyπ:S→Athat maximizeslearn action policy π: S→Athat maximizesE[rt + γrt+1 + γ2rt+2 + …]from any starting state in Sfrom any starting state in S• here 0 ≤γ< 1 is the discount factor for future re ardsrewardsNote something new:ttftiiSA•target function is π: S→A• but we have no training examples of form <s,a>CS 5751 Machine LearningChapter 13 Reinforcement Learning 6• training examples are of form <<s,a>,r>Value FunctionTo begin, consider deterministic worlds …For each possible policyπthe agent might adopt, weFor each possible policy πthe agent might adopt, we can define an evaluation function over states+++++≡221π... γγ)(Vtttrrrs∑∞+++≡21γ γγ)(ititttrwhere rt,rt+1,… are generated by following policy πstarting at state s=0igRestated, the task is to learn the optimal policy π*)(),(Vargmaxπ*πss∀≡CS 5751 Machine LearningChapter 13 Reinforcement Learning 7)(),(Vargmaxππss∀00081G10010000000000G1001009090727281818100r(s,a)(immediate reward) values90Q(s,a) values81(,)()G100090G1009081OtilliCS 5751 Machine LearningChapter 13 Reinforcement Learning 8V*(s) valuesOne optimal policyWhat to LearnWe might try to have agent learn the evaluation function Vπ*(which we write as V*)()We could then do a lookahead search to choose best action from any state s because[](s,a))V*(r(s,a)(s) δ γargmaxπ*a+≡A problem:• This works well if agent knows a δ: S×A → S, g,and r : S × A →ℜ• But when it doesn’t, we can’t choose actions this CS 5751 Machine LearningChapter 13 Reinforcement Learning 9wayQ FunctionDefine new function very similar to V*(s a))V*(r(s a)asQδγ)(+≡If agent learns Q it can choose optimal action even(s,a))V*(r(s,a)asQδγ),(+≡If agent learns Q, it can choose optimal action even without knowing d![]δγargmaxπ*(s,a))V*(r(s,a)(s)+≡[]),(argmaxπ*δ γargmaxπaasQ(s)(s,a))V(r(s,a)(s)≡+Q is the evaluation function the agent will learnaCS 5751 Machine LearningChapter 13 Reinforcement Learning 10Training Rule to Learn QNote Q and V* closely related:),(max*asQ(s)V′=Which allows us to write Q recursively as),(maxaasQ(s)V′δγ)()),a(sV*(),ar(s,asQ+=Let denote learner’s current approximation toQ)(max γ δγ)(1a,sQ),ar(s)),a(sV(),ar(s,asQtatttttttt′+=++′QˆLet denote learner s current approximation to Q. Consider training rule)(ˆmaxγ)(ˆasQrsaQ′′+←Qwhere s' is the state resulting from applying action ain states)(maxγ)(a,sQrs,aQa+←′CS 5751 Machine LearningChapter 13 Reinforcement Learning 11in state sQ Learning for Deterministic WorldsFor each s,a initialize table entry Observe current states0)(ˆ←s,aQObserve current state sDo forever:•Select an actionaand execute itSelect an action aand execute it• Receive immediate reward r•Observe the new states'•Observe the new state s• Update the table entry for as follows:ˆˆ′′)(ˆs,aQ')(maxγ)( a,sQrs,aQa′′+←′CS 5751 Machine LearningChapter 13 Reinforcement Learning 12•s←s'Updating100726381R100906381Riiilarightinitial state: s1next state: s290}1008163{900)(Qˆmaxγ),(ˆ2a1a,srasQright′+←′ˆˆ thennegative,-non rewards if notice90 }100,81,63max{9.00 =+←and),(),( ),,( 1asQasQnasnn≥∀+CS 5751 Machine LearningChapter 13 Reinforcement Learning 13),(),(ˆ 0 ),,( asQasQnasn≤≤∀Convergenceconverges to Q. Consider case of deterministic world where each <sa> visited infinitely oftenQˆwhere each <s,a> visited infinitely often.Proof: define a full interval to be an interval during which each <s,a> is visited. During each full interval the largest error in table is reduced by factor of γQˆLet be table after n updates, and Δnbe the maximum error in ; that isnQˆQˆin ; that isnQ),(),(ˆmax,asQasQnasn−=ΔCS 5751 Machine LearningChapter 13 Reinforcement Learning 14Convergence (cont)For any table entry updated on iteration n+1, the error in the revised estimate is),(ˆasQn),(ˆasQn)a,sQ()a,s(Q))a,sQ((r))a,s(Q(rasQasQanan 1maxˆmaxγmaxγˆmaxγ),(),(ˆ′′−′′=′′+−′′+=−′′+)a,sQ()a,s(Q)a,sQ()a,s(Qnaanaˆmaxγ maxmaxγ ′′−′′≤′′′asQasQ)a,sQ()a,s(Qnas1,γ)()(ˆˆmaxγ Δ≤−′′′−′′′≤′′′()f()f()f()fasQasQnn1fact that general used weNoteγ),(),(≤Δ≤+CS 5751 Machine LearningChapter 13 Reinforcement Learning 15(a)f(a)f(a)f(a)faaa2121-maxmax- max ≤Nondeterministic CaseWhat if reward and next state are


View Full Document

U of M CS 5541 - Reinforcement Learning

Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?