Reinforcement LearningAnnouncementsFormalizing the (online) reinforcement learning problemThe “Credit Assignment” ProblemExploration-Exploitation tradeoffTwo main reinforcement learning approachesRmax – A model-based approachGiven a dataset – learn modelSome challenges in model-based RL 1:Planning with insufficient informationSome challenges in model-based RL 2:Exploration-Exploitation tradeoffA surprisingly simple approach for model based RL – The Rmax algorithm [Brafman & Tennenholtz]Understanding RmaxImplicit Exploration-Exploitation LemmaThe Rmax algorithmVisit enough times to estimate P(x’|x,a)?Putting it all togetherProblems with model-based approachTD-Learning and Q-learning – Model-free approachesValue of PolicyA simple monte-carlo policy evaluationProblems with monte-carlo approachReusing trajectoriesSimple fix: Temporal Difference (TD) Learning [Sutton ’84]TD converges (can take a long time!!!)Using TD for ControlProblems with TDAnother model-free RL approach: Q-learning [Watkins & Dayan ’92]Recall Value IterationQ-learningQ-learning convergenceThe curse of dimensionality: A significant challenge in MDPs and RLAddressing the curse!What you need to know about RLBig PictureWhat you have learned this semesterBIG PICTUREWhat next?1Reading:Kaelbling et al. 1996 (see class website)Reinforcement LearningMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMay 3rd, 20062Announcements Project: Poster session: Friday May 5th2-5pm, NSH Atrium please arrive a little early to set up posterboards, easels, and pins provided class divided into two shift so you can see other posters FCEs!!!! Please, please, please, please, please, please give us your feedback, it helps us improve the class! ☺ http://www.cmu.edu/fce3Formalizing the (online) reinforcement learning problem Given a set of states X and actions A in some versions of the problem size of X and A unknown Interact with world at each time step t: world gives state xtand reward rt you give next action at Goal: (quickly) learn policy that (approximately) maximizes long-term expected discounted reward4The “Credit Assignment” ProblemYippee! I got to a state with a big reward! But which of my actions along the way actually helped me get there??This is the Credit Assignment problem.I’m in state 43, reward = 0, action = 2“ “ “ 39, “ = 0, “ = 4““ “22, “ = 0, “ = 1““ “21, “ = 0, “ = 1““ “21, “ = 0, “ = 1““ “13, “ = 0, “ = 2““ “54, “ = 0, “ = 2“ “ “ 26, “ = 100,5Exploration-Exploitation tradeoff You have visited part of the state space and found a reward of 100 is this the best I can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at the risk of missing out on some large reward somewhere Exploration: should I look for a region with more reward? at the risk of wasting my time or collecting a lot of negative reward6Two main reinforcement learning approaches Model-based approaches: explore environment → learn model (P(x’|x,a) and R(x,a)) (almost) everywhere use model to plan policy, MDP-style approach leads to strongest theoretical results works quite well in practice when state space is manageable Model-free approach: don’t learn a model → learn value function or policy directly leads to weaker theoretical results often works well when state space is large7Brafman & Tennenholtz 2002(see class website)Rmax – A model-based approach8Given a dataset – learn model Given data, learn (MDP) Representation: Dataset: Learn reward function: R(x,a) Learn transition model: P(x’|x,a)9Some challenges in model-based RL 1:Planning with insufficient information Model-based approach: estimate R(x,a) & P(x’|x,a) obtain policy by value or policy iteration, or linear programming No credit assignment problem → learning model, planning algorithm takes care of “assigning” credit What do you plug in when you don’t have enough information about a state? don’t reward at a particular state plug in smallest reward (Rmin)? plug in largest reward (Rmax)? don’t know a particular transition probability?10Some challenges in model-based RL 2:Exploration-Exploitation tradeoff A state may be very hard to reach waste a lot of time trying to learn rewards and transitions for this state after a much effort, state may be useless A strong advantage of a model-based approach: you know which states estimate for rewards and transitions are bad can (try) to plan to reach these states have a good estimate of how long it takes to get there11A surprisingly simple approach for model based RL – The Rmax algorithm[Brafman & Tennenholtz] Optimism in the face of uncertainty!!!! heuristic shown to be useful long before theory was done (e.g., Kaelbling ’90) If you don’t know reward for a particular state-action pair, set it to Rmax!!! If you don’t know the transition probabilities P(x’|x,a) from some some state action pair x,aassume you go to a magic, fairytale new state x0!!! R(x0,a) = Rmax P(x0|x0,a) = 112Understanding Rmax With Rmaxyou either: explore – visit a state-action pair you don’t know much about because it seems to have lots of potential exploit – spend all your time on known states even if unknown states were amazingly good, it’s not worth it Note: you never know if you are exploring or exploiting!!!Implicit Exploration-Exploitation Lemma13 Lemma: every T time steps, either: Exploits: achieves near-optimal reward for these T-steps, or Explores: with high probability, the agent visits an unknown state-action pair learns a little about an unknown state T is related to mixing time of Markov chain defined by MDP time it takes to (approximately) forget where you started14The Rmax algorithm Initialization: Add state x0 to MDP R(x,a) = Rmax, ∀x,a P(x0|x,a) = 1, ∀x,a all states (except for x0) are unknown Repeat obtain policy for current MDP and Execute policy for any visited state-action pair, set reward function to appropriate value if visited some state-action pair x,a enough times to estimate P(x’|x,a) update transition probs. P(x’|x,a) for x,a using MLE recompute policy15Visit enough times to estimate P(x’|x,a)?
View Full Document