DOC PREVIEW
CMU CS 10701 - Reinforcement Learning

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Reinforcement LearningAnnouncementsFormalizing the (online) reinforcement learning problemThe “Credit Assignment” ProblemExploration-Exploitation tradeoffTwo main reinforcement learning approachesRmax – A model-based approachGiven a dataset – learn modelSome challenges in model-based RL 1:Planning with insufficient informationSome challenges in model-based RL 2:Exploration-Exploitation tradeoffA surprisingly simple approach for model based RL – The Rmax algorithm [Brafman & Tennenholtz]Understanding RmaxImplicit Exploration-Exploitation LemmaThe Rmax algorithmVisit enough times to estimate P(x’|x,a)?Putting it all togetherProblems with model-based approachTD-Learning and Q-learning – Model-free approachesValue of PolicyA simple monte-carlo policy evaluationProblems with monte-carlo approachReusing trajectoriesSimple fix: Temporal Difference (TD) Learning [Sutton ’84]TD converges (can take a long time!!!)Using TD for ControlProblems with TDAnother model-free RL approach: Q-learning [Watkins & Dayan ’92]Recall Value IterationQ-learningQ-learning convergenceThe curse of dimensionality: A significant challenge in MDPs and RLAddressing the curse!What you need to know about RLBig PictureWhat you have learned this semesterBIG PICTUREWhat next?1Reading:Kaelbling et al. 1996 (see class website)Reinforcement LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 3rd, 20062Announcements Project: Poster session: Friday May 5th2-5pm, NSH Atrium  please arrive a little early to set up posterboards, easels, and pins provided class divided into two shift so you can see other posters FCEs!!!! Please, please, please, please, please, please give us your feedback, it helps us improve the class! ☺ http://www.cmu.edu/fce3Formalizing the (online) reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xtand reward rt you give next action at Goal: (quickly) learn policy that (approximately) maximizes long-term expected discounted reward4The “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of my actions along the way actually helped me get there??This is the Credit Assignment problem.I’m in state 43, reward = 0, action = 2“ “ “ 39, “ = 0, “ = 4““ “22, “ = 0, “ = 1““ “21, “ = 0, “ = 1““ “21, “ = 0, “ = 1““ “13, “ = 0, “ = 2““ “54, “ = 0, “ = 2“ “ “ 26, “ = 100,5Exploration-Exploitation tradeoff You have visited part of the state space and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at the risk of missing out on some large reward somewhere Exploration: should I look for a region with more reward? at the risk of wasting my time or collecting a lot of negative reward6Two main reinforcement learning approaches  Model-based approaches: explore environment → learn model (P(x’|x,a) and R(x,a)) (almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results  works quite well in practice when state space is manageable  Model-free approach: don’t learn a model → learn value function or policy directly leads to weaker theoretical results often works well when state space is large7Brafman & Tennenholtz 2002(see class website)Rmax – A model-based approach8Given a dataset – learn model Given data, learn (MDP) Representation: Dataset: Learn reward function:  R(x,a) Learn transition model:  P(x’|x,a)9Some challenges in model-based RL 1:Planning with insufficient information  Model-based approach: estimate R(x,a) & P(x’|x,a)  obtain policy by value or policy iteration, or linear programming No credit assignment problem → learning model, planning algorithm takes care of “assigning” credit What do you plug in when you don’t have enough information about a state?  don’t reward at a particular state plug in smallest reward (Rmin)? plug in largest reward (Rmax)? don’t know a particular transition probability?10Some challenges in model-based RL 2:Exploration-Exploitation tradeoff A state may be very hard to reach waste a lot of time trying to learn rewards and transitions for this state after a much effort, state may be useless A strong advantage of a model-based approach: you know which states estimate for rewards and transitions are bad can (try) to plan to reach these states have a good estimate of how long it takes to get there11A surprisingly simple approach for model based RL – The Rmax algorithm[Brafman & Tennenholtz] Optimism in the face of uncertainty!!!! heuristic shown to be useful long before theory was done (e.g., Kaelbling ’90)  If you don’t know reward for a particular state-action pair, set it to Rmax!!! If you don’t know the transition probabilities P(x’|x,a) from some some state action pair x,aassume you go to a magic, fairytale new state x0!!! R(x0,a) = Rmax P(x0|x0,a) = 112Understanding Rmax With Rmaxyou either: explore – visit a state-action pair you don’t know much about because it seems to have lots of potential exploit – spend all your time on known states even if unknown states were amazingly good, it’s not worth it Note: you never know if you are exploring or exploiting!!!Implicit Exploration-Exploitation Lemma13 Lemma: every T time steps, either: Exploits: achieves near-optimal reward for these T-steps, or Explores: with high probability, the agent visits an unknown state-action pair learns a little about an unknown state T is related to mixing time of Markov chain defined by MDP time it takes to (approximately) forget where you started14The Rmax algorithm Initialization:  Add state x0 to MDP R(x,a) = Rmax, ∀x,a P(x0|x,a) = 1, ∀x,a all states (except for x0) are unknown Repeat obtain policy for current MDP and Execute policy for any visited state-action pair, set reward function to appropriate value if visited some state-action pair x,a enough times to estimate P(x’|x,a)  update transition probs. P(x’|x,a) for x,a using MLE recompute policy15Visit enough times to estimate P(x’|x,a)?


View Full Document

CMU CS 10701 - Reinforcement Learning

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?