This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Reinforcement LearningManuela VelosoManuela Velososee “Machine Learning” – Tom Mitchell, chapter 13 on RL15-381, Fall 2009Different Aspects of “Machine Learning”• Supervised learning– Classification - concept learning – Learning from labeled data– Function approximation• Unsupervised learning–Data is not labeled2–Data is not labeled– Data needs to be grouped, clustered– We need distance metric• Control and action model learning– Learning to select actions efficiently– Feedback: goal achievement, failure, reward– Control learning, reinforcement learningGoal Achievement - Rewards• “Reward” today versus future (promised) reward• Future rewards not worth as much as current.• $100K + $100K + $100K + ...INFINITE sum•Assume reality ...: discount factor, say γ.3•Assume reality ...: discount factor, say γ.• $100K + γ $100K + γ2$100K + ...CONVERGES.Reinforcement Learning Problem4Goal: Learn to choose actions that maximizer0 + γr1 + γ2r2 + ... , where 0 ≤ γ < 1Learning Conditions• Assume world can be modeled as a Markov Decision Process, with rewards as a function of state and action.• Markov assumption:New states and rewards are a function only of the current state and action, i.e.,–s= δ(s, a)5–st+1= δ(st, at)– rt= r(st, at)• Unknown and uncertain environment:Functions δ and r may be nondeterministic and are not necessarily known to learner.Control Learning Task• Execute actions in world,• Observe state of world,• Learn action policy π : S → A• Maximize expected reward6E[rt+ γrt+1 + γ2rt+2 + ...]from any starting state in S.– 0 ≤ γ < 1, discount factor for future rewardsStatement of Learning Problem• We have a target function to learn π : S → A• We have no training examples of the form 〈s, a〉• We have training examples of the form 〈〈s, a〉, r〉(rewards can be any real number)7immediate reward values r(s,a)Policies• There are many possible policies, of course not necessarily optimal, i.e., with maximum expected reward• There can be also several OPTIMAL policies.8Value Function• For each possible policy π, define an evaluation function over states (deterministic world)()∑∞=+++≡+++≡0121... iititttrrrrsVγγγπ9where rt, rt+1,... are generated by following policy πstarting at state s• Learning task: Learn OPTIMAL policyπ* ≡ argmaxπVπ(s), (∀s)=0iLearn Value Function• Learn the evaluation function Vπ* - V*.• Select the optimal action from any state s, i.e., have an optimal policy, by using V* with one step lookahead:()()()()[]asVasrsa,,maxarg**δγπ+=10Optimal Value to Optimal PolicyA problem:•This works well if agent knows δ: S×A→ S, and π*(s) = argmaxa[r(s,a) + γV*(δ(s,a))]11•This works well if agent knows δ: S×A→ S, and r : S × A → ℜ• When it doesn’t, it can’t choose actions this wayQ Function• Define new function very similar to V*Q(s,a) ≡ r(s,a) + γV*(δ(s,a))Learn Q function – Q-learning12• If agent learns Q, it can choose optimal action even without knowing δ or r.()()()()[]( )),(maxarg,,maxarg***asQsasVasrsaa=+=πδγπQ-LearningNote that Q and V* are closely related:Which allows us to write Q recursively as()()asQsVa′=′∗,max13Q-learning actively generates examples.It “processes” examples by updating its Q values.While learning, Q values are approximations.()()()()( )( )asQasrasVasrasQtatttttttt′+=+=+′∗,max,,, ,1γδγTraining Rule to Learn QLet Q denote current approximation to Q.Then Q-learning uses the following training rule:where s′is the state resulting from applying action ain state ˆ()()asQrasQa′′+←′,ˆmax,ˆγ14where s′is the state resulting from applying action ain state s,and r is the reward that is returned.Example - Updating Q^15()(){ }90 100 ,81 ,63 max 9.0 0 ,ˆmax ,ˆ21←+←′+←′asQrasQarightγQ Learning for Deterministic WorldsFor each s, a initialize table entry Q(s,a) ← 0Observe current state sDo forever:• Select an action a and execute it•Receive immediate reward rˆ16•Receive immediate reward r• Observe the new state s′• Update the table entry for Q(s,a) as follows:• s ← s′ˆ()()asQrasQa′′+←′,ˆmax,ˆγQ Learning IterationsStarts at bottom left corner – moves clockwise around perimeter;Initially Q(s,a) = 0; γ = 0.817()()asQrasQa′′+←′,ˆmax,ˆγProblem - DeterministicHow many possible policies are there in this 3-state, 2-action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing 18A robot starts in the state Mild. It moves for 4 steps choosing actions West, East, East, West. The initial values of its Q-table are 0 and the discount factor is γ = 0.5Initial State: MILD Action: WestNew State: HOTAction: EastNew State: MILDAction: EastNew State: COLDAction: WestNew State: MILDEast West East West East West East West East WestHOT 0 0 0 0 5 0 5 0 5 0MILD 0 0 0 10 0 10 0 10 0 10COLD 0 0 0 0 0 0 0 0 0 -5Another Deterministic Example19Nondeterministic CaseWhat if reward and next state are non-deterministic?We redefine V, Q by taking expected values()[]rrrEsVttt... 221γγπ+++≡∑∞++20( ) ( ) ( )( )[ ]asVasrEasQrEiiti,, , *0δγγ+≡≡∑=+Nondeterministic CaseQ learning generalizes to nondeterministic worldsAlter training rule to()()()(),,ˆmax ,ˆ1 ,ˆ 1n1asQrasQasQnnnnγαα′′++−←−−21()( )( )1992) Dayan, and (Watkins toconverges still ˆ., and , where,,max *,111nQQassasQrasnvisitsnnaδαγα=′=′′++−′Nondeterministic Example22Nondeterministic Exampleπ*(s) = D, for any s= S1, S2, S3, and S4, γ = 0.9.-------------------------------------------------------------V∗(S2) = r(S2,D) + 0.9 (1.0 V∗(S2))V∗(S2) = 100 + 0.9 V∗(S2)V∗(S2) = 1000.V∗(S1) = r(S1,D) + 0.9 (1.0 V∗(S2))V∗(S1) = 0 + 0.9 x 100023V∗(S1) = 0 + 0.9 x 1000V∗(S1) = 900.V∗(S3) = r(S3,D) + 0.9 (0.9 V∗(S2) + 0.1 V∗(S3))V∗(S3) = 0 + 0.9 (0.9 x 1000 + 0.1 V∗(S3))V∗(S3) = 81000/91.V∗(S4) = r(S4,D) + 0.9 (0.9 V∗(S2) + 0.1 V∗(S4))V∗(S4) = 40 + 0.9 (0.9 x 1000 + 0.1 V∗(S4))V∗(S4) = 85000/91.Nondeterministic ExampleWhat is the Q-value, Q(S2,R)?Q(S2,R) = r(S2,R) + 0.9 (0.9 V∗(S1) + 0.1 V∗(S2))Q(S2,R) = 100 + 0.9 (0.9 x 900 + 0.1 x 1000)Q(S2,R) = 100 + 0.9 (810 + 100)Q(S2,R) = 100 + 0.9 x 91024Q(S2,R) = 919.Discussion• How should the learning agent use the intermediate Qvalues?–


View Full Document

CMU CS 15381 - Lecture

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?