Reinforcement LearningManuela VelosoManuela Velososee “Machine Learning” – Tom Mitchell, chapter 13 on RL15-381, Fall 2009Different Aspects of “Machine Learning”• Supervised learning– Classification - concept learning – Learning from labeled data– Function approximation• Unsupervised learning–Data is not labeled2–Data is not labeled– Data needs to be grouped, clustered– We need distance metric• Control and action model learning– Learning to select actions efficiently– Feedback: goal achievement, failure, reward– Control learning, reinforcement learningGoal Achievement - Rewards• “Reward” today versus future (promised) reward• Future rewards not worth as much as current.• $100K + $100K + $100K + ...INFINITE sum•Assume reality ...: discount factor, say γ.3•Assume reality ...: discount factor, say γ.• $100K + γ $100K + γ2$100K + ...CONVERGES.Reinforcement Learning Problem4Goal: Learn to choose actions that maximizer0 + γr1 + γ2r2 + ... , where 0 ≤ γ < 1Learning Conditions• Assume world can be modeled as a Markov Decision Process, with rewards as a function of state and action.• Markov assumption:New states and rewards are a function only of the current state and action, i.e.,–s= δ(s, a)5–st+1= δ(st, at)– rt= r(st, at)• Unknown and uncertain environment:Functions δ and r may be nondeterministic and are not necessarily known to learner.Control Learning Task• Execute actions in world,• Observe state of world,• Learn action policy π : S → A• Maximize expected reward6E[rt+ γrt+1 + γ2rt+2 + ...]from any starting state in S.– 0 ≤ γ < 1, discount factor for future rewardsStatement of Learning Problem• We have a target function to learn π : S → A• We have no training examples of the form 〈s, a〉• We have training examples of the form 〈〈s, a〉, r〉(rewards can be any real number)7immediate reward values r(s,a)Policies• There are many possible policies, of course not necessarily optimal, i.e., with maximum expected reward• There can be also several OPTIMAL policies.8Value Function• For each possible policy π, define an evaluation function over states (deterministic world)()∑∞=+++≡+++≡0121... iititttrrrrsVγγγπ9where rt, rt+1,... are generated by following policy πstarting at state s• Learning task: Learn OPTIMAL policyπ* ≡ argmaxπVπ(s), (∀s)=0iLearn Value Function• Learn the evaluation function Vπ* - V*.• Select the optimal action from any state s, i.e., have an optimal policy, by using V* with one step lookahead:()()()()[]asVasrsa,,maxarg**δγπ+=10Optimal Value to Optimal PolicyA problem:•This works well if agent knows δ: S×A→ S, and π*(s) = argmaxa[r(s,a) + γV*(δ(s,a))]11•This works well if agent knows δ: S×A→ S, and r : S × A → ℜ• When it doesn’t, it can’t choose actions this wayQ Function• Define new function very similar to V*Q(s,a) ≡ r(s,a) + γV*(δ(s,a))Learn Q function – Q-learning12• If agent learns Q, it can choose optimal action even without knowing δ or r.()()()()[]( )),(maxarg,,maxarg***asQsasVasrsaa=+=πδγπQ-LearningNote that Q and V* are closely related:Which allows us to write Q recursively as()()asQsVa′=′∗,max13Q-learning actively generates examples.It “processes” examples by updating its Q values.While learning, Q values are approximations.()()()()( )( )asQasrasVasrasQtatttttttt′+=+=+′∗,max,,, ,1γδγTraining Rule to Learn QLet Q denote current approximation to Q.Then Q-learning uses the following training rule:where s′is the state resulting from applying action ain state ˆ()()asQrasQa′′+←′,ˆmax,ˆγ14where s′is the state resulting from applying action ain state s,and r is the reward that is returned.Example - Updating Q^15()(){ }90 100 ,81 ,63 max 9.0 0 ,ˆmax ,ˆ21←+←′+←′asQrasQarightγQ Learning for Deterministic WorldsFor each s, a initialize table entry Q(s,a) ← 0Observe current state sDo forever:• Select an action a and execute it•Receive immediate reward rˆ16•Receive immediate reward r• Observe the new state s′• Update the table entry for Q(s,a) as follows:• s ← s′ˆ()()asQrasQa′′+←′,ˆmax,ˆγQ Learning IterationsStarts at bottom left corner – moves clockwise around perimeter;Initially Q(s,a) = 0; γ = 0.817()()asQrasQa′′+←′,ˆmax,ˆγProblem - DeterministicHow many possible policies are there in this 3-state, 2-action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing 18A robot starts in the state Mild. It moves for 4 steps choosing actions West, East, East, West. The initial values of its Q-table are 0 and the discount factor is γ = 0.5Initial State: MILD Action: WestNew State: HOTAction: EastNew State: MILDAction: EastNew State: COLDAction: WestNew State: MILDEast West East West East West East West East WestHOT 0 0 0 0 5 0 5 0 5 0MILD 0 0 0 10 0 10 0 10 0 10COLD 0 0 0 0 0 0 0 0 0 -5Another Deterministic Example19Nondeterministic CaseWhat if reward and next state are non-deterministic?We redefine V, Q by taking expected values()[]rrrEsVttt... 221γγπ+++≡∑∞++20( ) ( ) ( )( )[ ]asVasrEasQrEiiti,, , *0δγγ+≡≡∑=+Nondeterministic CaseQ learning generalizes to nondeterministic worldsAlter training rule to()()()(),,ˆmax ,ˆ1 ,ˆ 1n1asQrasQasQnnnnγαα′′++−←−−21()( )( )1992) Dayan, and (Watkins toconverges still ˆ., and , where,,max *,111nQQassasQrasnvisitsnnaδαγα=′=′′++−′Nondeterministic Example22Nondeterministic Exampleπ*(s) = D, for any s= S1, S2, S3, and S4, γ = 0.9.-------------------------------------------------------------V∗(S2) = r(S2,D) + 0.9 (1.0 V∗(S2))V∗(S2) = 100 + 0.9 V∗(S2)V∗(S2) = 1000.V∗(S1) = r(S1,D) + 0.9 (1.0 V∗(S2))V∗(S1) = 0 + 0.9 x 100023V∗(S1) = 0 + 0.9 x 1000V∗(S1) = 900.V∗(S3) = r(S3,D) + 0.9 (0.9 V∗(S2) + 0.1 V∗(S3))V∗(S3) = 0 + 0.9 (0.9 x 1000 + 0.1 V∗(S3))V∗(S3) = 81000/91.V∗(S4) = r(S4,D) + 0.9 (0.9 V∗(S2) + 0.1 V∗(S4))V∗(S4) = 40 + 0.9 (0.9 x 1000 + 0.1 V∗(S4))V∗(S4) = 85000/91.Nondeterministic ExampleWhat is the Q-value, Q(S2,R)?Q(S2,R) = r(S2,R) + 0.9 (0.9 V∗(S1) + 0.1 V∗(S2))Q(S2,R) = 100 + 0.9 (0.9 x 900 + 0.1 x 1000)Q(S2,R) = 100 + 0.9 (810 + 100)Q(S2,R) = 100 + 0.9 x 91024Q(S2,R) = 919.Discussion• How should the learning agent use the intermediate Qvalues?–
View Full Document