CMU CS 10701 - Reinforcement Learning - D505429

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> Reinforcement Learning

DOC PREVIEW

CMU CS 10701 - Reinforcement Learning

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 37

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 37 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Reinforcement LearningAnnouncementsFormalizing the (online) reinforcement learning problemThe “Credit Assignment” ProblemExploration-Exploitation tradeoffTwo main reinforcement learning approachesRmax – A model-based approachGiven a dataset – learn modelSome challenges in model-based RL 1:Planning with insufficient informationSome challenges in model-based RL 2:Exploration-Exploitation tradeoffA surprisingly simple approach for model based RL – The Rmax algorithm [Brafman & Tennenholtz]Understanding RmaxImplicit Exploration-Exploitation LemmaThe Rmax algorithmVisit enough times to estimate P(x’|x,a)?Putting it all togetherProblems with model-based approachTD-Learning and Q-learning – Model-free approachesValue of PolicyA simple monte-carlo policy evaluationProblems with monte-carlo approachReusing trajectoriesSimple fix: Temporal Difference (TD) Learning [Sutton ’84]TD converges (can take a long time!!!)Using TD for ControlProblems with TDAnother model-free RL approach: Q-learning [Watkins & Dayan ’92]Recall Value IterationQ-learningQ-learning convergenceThe curse of dimensionality: A significant challenge in MDPs and RLAddressing the curse!What you need to know about RLBig PictureWhat you have learned this semesterBIG PICTUREWhat next?1Reading:Kaelbling et al. 1996 (see class website)Reinforcement LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 3rd, 20062Announcements Project: Poster session: Friday May 5th2-5pm, NSH Atrium  please arrive a little early to set up posterboards, easels, and pins provided class divided into two shift so you can see other posters FCEs!!!! Please, please, please, please, please, please give us your feedback, it helps us improve the class! ☺ http://www.cmu.edu/fce3Formalizing the (online) reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xtand reward rt you give next action at Goal: (quickly) learn policy that (approximately) maximizes long-term expected discounted reward4The “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of my actions along the way actually helped me get there??This is the Credit Assignment problem.I’m in state 43, reward = 0, action = 2“ “ “ 39, “ = 0, “ = 4““ “22, “ = 0, “ = 1““ “21, “ = 0, “ = 1““ “21, “ = 0, “ = 1““ “13, “ = 0, “ = 2““ “54, “ = 0, “ = 2“ “ “ 26, “ = 100,5Exploration-Exploitation tradeoff You have visited part of the state space and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at the risk of missing out on some large reward somewhere Exploration: should I look for a region with more reward? at the risk of wasting my time or collecting a lot of negative reward6Two main reinforcement learning approaches  Model-based approaches: explore environment → learn model (P(x’|x,a) and R(x,a)) (almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results  works quite well in practice when state space is manageable  Model-free approach: don’t learn a model → learn value function or policy directly leads to weaker theoretical results often works well when state space is large7Brafman & Tennenholtz 2002(see class website)Rmax – A model-based approach8Given a dataset – learn model Given data, learn (MDP) Representation: Dataset: Learn reward function:  R(x,a) Learn transition model:  P(x’|x,a)9Some challenges in model-based RL 1:Planning with insufficient information  Model-based approach: estimate R(x,a) & P(x’|x,a)  obtain policy by value or policy iteration, or linear programming No credit assignment problem → learning model, planning algorithm takes care of “assigning” credit What do you plug in when you don’t have enough information about a state?  don’t reward at a particular state plug in smallest reward (Rmin)? plug in largest reward (Rmax)? don’t know a particular transition probability?10Some challenges in model-based RL 2:Exploration-Exploitation tradeoff A state may be very hard to reach waste a lot of time trying to learn rewards and transitions for this state after a much effort, state may be useless A strong advantage of a model-based approach: you know which states estimate for rewards and transitions are bad can (try) to plan to reach these states have a good estimate of how long it takes to get there11A surprisingly simple approach for model based RL – The Rmax algorithm[Brafman & Tennenholtz] Optimism in the face of uncertainty!!!! heuristic shown to be useful long before theory was done (e.g., Kaelbling ’90)  If you don’t know reward for a particular state-action pair, set it to Rmax!!! If you don’t know the transition probabilities P(x’|x,a) from some some state action pair x,aassume you go to a magic, fairytale new state x0!!! R(x0,a) = Rmax P(x0|x0,a) = 112Understanding Rmax With Rmaxyou either: explore – visit a state-action pair you don’t know much about because it seems to have lots of potential exploit – spend all your time on known states even if unknown states were amazingly good, it’s not worth it Note: you never know if you are exploring or exploiting!!!Implicit Exploration-Exploitation Lemma13 Lemma: every T time steps, either: Exploits: achieves near-optimal reward for these T-steps, or Explores: with high probability, the agent visits an unknown state-action pair learns a little about an unknown state T is related to mixing time of Markov chain defined by MDP time it takes to (approximately) forget where you started14The Rmax algorithm Initialization:  Add state x0 to MDP R(x,a) = Rmax, ∀x,a P(x0|x,a) = 1, ∀x,a all states (except for x0) are unknown Repeat obtain policy for current MDP and Execute policy for any visited state-action pair, set reward function to appropriate value if visited some state-action pair x,a enough times to estimate P(x’|x,a)  update transition probs. P(x’|x,a) for x,a using MLE recompute policy15Visit enough times to estimate P(x’|x,a)?

View Full Document

CMU CS 10701 - Reinforcement Learning

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

CMU CS 10701 - Reinforcement Learning

Sign up for free to view:

Please select your school