Inverse Reinforcement LearningPieter AbbeelUC Berkeley EECSHigh-level pictureDynamics Model TReinforcementProbability distribution over next states given current state and actionDescribes desirability of being in a state. Reward Function RReinforcementLearning / Optimal ControlController/Policy π∗Prescribes action to take for each stateInverse RL: Given π*and T, can we recover R?More generally, given execution traces, can we recover R? Scientific inquiry Model animal and human behavior E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.]Apprenticeship learning/Imitation learning through Motivation for inverse RLApprenticeship learning/Imitation learning through inverse RL Presupposition: reward function provides the most succinct and transferable definition of the task Has enabled advancing the state of the art in various robotic domains Modeling of other agents, both adversarial and cooperative Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flight Simulated highway driving Abbeel and Ng, ICML 2004, Syed and Schapire, NIPS 2007 Aerial imagery based navigation Ratliff, Bagnell and Zinkevich, ICML 2006Examples Parking lot navigation Abbeel, Dolgov, Ng and Thrun, IROS 2008 Urban navigation Ziebart, Maas, Bagnell and Dey, AAAI 2008 Quadruped locomotion Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007 Kolter, Abbeel and Ng, NIPS 2008Urban navigation Reward function for urban navigation?Ziebart, Maas, Bagnell and Dey AAAI 2008 destination prediction Examples of apprenticeship learning via inverse RLInverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flight Input: State space, action space Transition model Psa(st+1| st, at)Noreward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …(= trace of the teacher’s policy π*)Problem setup(= trace of the teacher’s policy π*) Inverse RL: Can we recover R ? Apprenticeship learning via inverse RL Can we then use this R to find a good policy ? Behavioral cloning Can we directly learn the teacher’s policy using supervised learning? Formulate as standard machine learning problem Fix a policy class E.g., support vector machine, neural network, decision tree, deep belief net, … Estimate a policy (=mapping from states to actions) from the training examples (s, a), (s, a), (s, a), …Behavioral cloningfrom the training examples (s0, a0), (s1, a1), (s2, a2), … Two of the most notable success stories: Pomerleau, NIPS 1989: ALVINN Sammut et al., ICML 1992: Learning to fly (flight sim) Which has the most succinct description: ππππ* vs. RRRR*? Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.Inverse RL vs. behavioral cloningis often much more succinct than the optimal policy. Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloningHistorical sketch of inverse RL Mathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flight 1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case 1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting 2000, Ng and Russell: first MDP formulation, reward Inverse RL history2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested 2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching 2006, Ratliff+al: max margin formulation 2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features 2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class 2008, Kolter, Abbeel and Ng: hierarchical max-marginInverse RL history 2008, Syed and Schapire: feature matching + game theoretic formulation 2008, Ziebart+al: feature matching + max entropy 2008, Abbeel+al: feature matching -- application to learning parking lot navigation style Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ? Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RLMathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flightThree broad categories of formalizations Max margin Feature expectation matching Interpret reward function as parameterization of a policy class Find a reward function R* which explains the expert behaviour. Find R* such thatIn fact a convex feasibility problem, but many challenges:Basic principleE[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀πIn fact a convex feasibility problem, but many challenges: R=0 is a solution, more generally: reward function ambiguity We typically only observe expert traces rather than the entire expert policy π* --- how to compute LHS? Assumes the expert is indeed optimal --- otherwise infeasible Computationally: assumes we can enumerate all policies ffFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.E[∞t=0γtR(st)|π] = E[∞t=0γtw⊤φ(st)|π]= w⊤E[∞t=0γtφ(st)|π] Subbing intogives us: t=0|= w⊤µ(π)Expected cumulative discounted sum of feature values or “feature expectations”E[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀πFind w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.Find w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πE[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀π Feature expectations can be readily estimated from sample trajectories. The number of expert demonstrations
View Full Document