DOC PREVIEW
CMU CS 15381 - Reinforcement Learning

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Reinforcement LearningReinforcement Learning• R&N Chapter 21• Demos and Data Contributions from Vivek Mehta ([email protected])Rohit Kelkar ([email protected])2Reinforcement Learning• Same (fully observable) MDP as before except:– We don’t know the model of the environment– We don’t know T(.,.,.)– We don’t know R(.)• Task is still the same:– Find an optimal policy+1-13211 2 3 40.10.10.8Intended action a:T(s,a,s’)General Problem• All we can do is try to execute actions and record the resulting rewards– World: You are in state 102, you have a choice of 4 actions– Robot: I’ll take action 2– World: You get a reward of 1 and you are now in state 63, you have a choice of 3 actions– Robot: I’ll take action 3– World: You get a reward of -10 and you are now in state 12, you have a choice of 4 actions– …………..Learning from experience ….3Classes of TechniquesModel-Based• Try to learn an explicit model of T(.,.,.) and R(.)Reinforcement LearningModel-Free• Recover an optimal policy without ever estimating a modelModel-Based• If we knew a good estimate Test(.,.,.) of T(.,.,.) and R(.), we could evaluate the optimal policy by solving the fundamental MDP relations:Uest(s) = R(s) + γ γ γ γ maxa(Σs’Test(s,a,s’) Uest(s’))π*(s) = argmaxa(Σs’Test(s,a,s’) Uest(s’))4Model EstimationI observed a trajectory during which, when I moved Up from s = (1,1), I ended up ins1= (1,2) 10 timess2= (2,1) 2 timesT(s, Up, s1) ~ 10/(10+2) = 0.83T(s, Up, s2) ~ 2/(10+2) = 0.17+1-13211 2 3 4s1s2sModel-Based• Move through the environment by executing a sequence of actions• Evaluate T and R:– R(s) = Reward received when visiting state s– Test(s, a, s’) ~ (# times we moved from s to s’on action a)/(# times we applied action a from s)• This gives us an estimated model of the Markov system5Model-Based• Given Testand R, we can now estimate the value at each state:Uest(s)=R(s)+γγγγ maxa(Σs’Test(s,a,s’) Uest(s’))– Value iteration– Policy iteration• This can be expensive if we do that at each step• May require matrix inversion (size = number of states) or• Many iterations of value iteration• (Certainty Equivalent learning)Best Policy?6π*(S1) = a2π*(S2) = a2Problems• Separates learning the model from using the model (not on-line learning)• Expensive because entire set of equations is solved to find Uest• How should the environment be explored?  No guidance until model is built• Cannot handle changing environments7Solution• Update Uestfor state s only instead of solving for all the states Uest(s)  R(s) + γ γ γ γ maxa(Σs’Test(s,a,s’) Uest(s’))• Similar to one step of value iteration• Terminology  Backup step• Advantage: – Computation interleaved with exploration– Less computation at each stepExample: Model-Based Learning• Update the current estimate of U(s) = expected sum of future discounted reward using estimated T(.,.,.)Uest(s) R(s) + γ γ γ γ maxa(Σs’Test(s,a,s’) Uest(s’))π(s) = argmaxa(Σs’Test(s,a,s’) Uest(s’))8Two Problems• Which states to update? Uestmay have already converged for some states, so that the update does not make any difference• How to explore the environment? We have not said how we generate the actions aWhich State to Update: Prioritized Sweeping• Idea: Update the predecessors of the states that yield the largest change in UestUest(s) = 5 s1s2a1a2Uest(s) = 100 s1s2a1a2After update of Uest(s)9Uest(s) = 5 s1s2a1a2Uest(s) = 100 s1s2a1a2After update of Uest(s)Uest(s) has changed a lot andUest(s1) = R(s1) + …. + T(s1,a1,s)Uest(s)So Uest(s1) is probably going to change a lot too so we should update it right awayUest(s) = 5s1s2a1a2Uest(s) = 5.2s1s2a1a2After update of Uest(s)Uest(s) has not changed a lot andUest(s1) = R(s1) + …. + T(s1,a1,s)Uest(s)So Uest(s1) is probably not going to change a lot so it’s not useful to waste time updating it10Prioritized Sweeping• For each state: Remember Pred(s) = {visited states s’and action a such that a moves from s’ to s}• Store P = priority queue with “most promising” state first1. s = top of the queue; Uold= current value of Uest(s)2. Update the state valueUest(s)  R(s) + γ γ γ γ maxa(Σs’Test(s,a,s’) Uest(s’))3. ∆ = |Uold– Uest(s)|5. For every predecessor (sp,ap) in Pred(s)– Add spto P with priority:Test(sp,ap,s)∆Prioritized Sweeping• For each state: Remember Pred(s) = {visited states s’and action a such that a moves from s’ to s}• Store P = priority queue with “most promising” state first1. s = top of the queue; Uold= current value of Uest(s)2. Update the state valueUest(s)  R(s) + γ γ γ γ maxa(Σs’Test(s,a,s’) Uest(s’))3. ∆ = |Uold– Uest(s)|5. For every predecessor (sp,ap) in Pred(s)– Add spto P with priority:Test(sp,ap,s)∆∆∆∆∆ small = boring state, no change in the value∆∆∆∆ large = interesting state, new information is discoveredThe value for spis likely to change if:1. The value of its successor s has changed a lot (∆∆∆∆ is large) and2. Some action is likely to move from spto s11Exploration Strategy• In principle, we can compute a current estimate of the best policy:π*(s) = argmaxa(Σs’Test(s,a,s’) Uest(s’))• What is then the best strategy for exploration?– Greedy: Always use π∗(s) when in state s?– Random– Mixed: Sometimes use the best and sometimes use randomWhy Not Obvious?• N-armed bandit problem:• We have N slot machines, each can yield $1 with some probability (different for each machine)• In what order should we try the machines?– Stay with the machine with highest probability so far?– Random?– Something else?12Possible Solutions• εεεε-greedy– Choose the (current) best one with probability 1-εεεε– Choose another one randomly with probability ε/(ε/(ε/(ε/(number of machines – 1))))• Boltzmann exploration– Choose machine k with Prob ~ e –Pk/T– Decrease T as time passesRemember the lectures on randomized search….Maze ExampleCurrent optimal action for this state is UpMove Up with probability 1-εεεεMove Right with probability εεεε/3333Move Left with probability ε/3ε/3ε/3ε/3Move Down with probability ε/3ε/3ε/3ε/313Model-Free• We are not interested in T(.,.,.), we are only interested in the resulting values and policies• Can we compute something without an explicit model of T(.,.,.) ?• First, let’s fix a


View Full Document

CMU CS 15381 - Reinforcement Learning

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?