DOC PREVIEW
Berkeley COMPSCI 287 - Inverse Reinforcement Learning

This preview shows page 1-2-3-4-5-6-40-41-42-43-44-82-83-84-85-86-87 out of 87 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 87 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Inverse Reinforcement LearningPieter AbbeelUC Berkeley EECSHigh-level pictureDynamics Model TReinforcementProbability distribution over next states given current state and actionDescribes desirability of being in a state. Reward Function RReinforcementLearning / Optimal ControlController/Policy π∗Prescribes action to take for each stateInverse RL: Given π*and T, can we recover R?More generally, given execution traces, can we recover R? Scientific inquiry Model animal and human behavior E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.]Apprenticeship learning/Imitation learning through Motivation for inverse RLApprenticeship learning/Imitation learning through inverse RL Presupposition: reward function provides the most succinct and transferable definition of the task Has enabled advancing the state of the art in various robotic domains Modeling of other agents, both adversarial and cooperative Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flight Simulated highway driving Abbeel and Ng, ICML 2004, Syed and Schapire, NIPS 2007 Aerial imagery based navigation  Ratliff, Bagnell and Zinkevich, ICML 2006Examples Parking lot navigation Abbeel, Dolgov, Ng and Thrun, IROS 2008 Urban navigation Ziebart, Maas, Bagnell and Dey, AAAI 2008 Quadruped locomotion Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007 Kolter, Abbeel and Ng, NIPS 2008Urban navigation Reward function for urban navigation?Ziebart, Maas, Bagnell and Dey AAAI 2008 destination prediction Examples of apprenticeship learning via inverse RLInverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flight Input:  State space, action space Transition model Psa(st+1| st, at)Noreward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …(= trace of the teacher’s policy π*)Problem setup(= trace of the teacher’s policy π*) Inverse RL:  Can we recover R ? Apprenticeship learning via inverse RL Can we then use this R to find a good policy ? Behavioral cloning Can we directly learn the teacher’s policy using supervised learning? Formulate as standard machine learning problem Fix a policy class E.g., support vector machine, neural network, decision tree, deep belief net, …  Estimate a policy (=mapping from states to actions) from the training examples (s, a), (s, a), (s, a), …Behavioral cloningfrom the training examples (s0, a0), (s1, a1), (s2, a2), … Two of the most notable success stories: Pomerleau, NIPS 1989: ALVINN  Sammut et al., ICML 1992: Learning to fly (flight sim) Which has the most succinct description: ππππ* vs. RRRR*? Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.Inverse RL vs. behavioral cloningis often much more succinct than the optimal policy. Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloningHistorical sketch of inverse RL Mathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flight 1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case 1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting 2000, Ng and Russell: first MDP formulation, reward Inverse RL history2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested 2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching 2006, Ratliff+al: max margin formulation 2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features 2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class 2008, Kolter, Abbeel and Ng: hierarchical max-marginInverse RL history 2008, Syed and Schapire: feature matching + game theoretic formulation 2008, Ziebart+al: feature matching + max entropy 2008, Abbeel+al: feature matching -- application to learning parking lot navigation style Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ? Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RLMathematical formulations for inverse RLLecture outline Case studies Trajectory-based reward, with application to autonomous helicopter flightThree broad categories of formalizations Max margin  Feature expectation matching  Interpret reward function as parameterization of a policy class Find a reward function R* which explains the expert behaviour. Find R* such thatIn fact a convex feasibility problem, but many challenges:Basic principleE[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀πIn fact a convex feasibility problem, but many challenges: R=0 is a solution, more generally: reward function ambiguity We typically only observe expert traces rather than the entire expert policy π* --- how to compute LHS? Assumes the expert is indeed optimal --- otherwise infeasible Computationally: assumes we can enumerate all policies ffFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.E[∞t=0γtR(st)|π] = E[∞t=0γtw⊤φ(st)|π]= w⊤E[∞t=0γtφ(st)|π] Subbing intogives us: t=0|= w⊤µ(π)Expected cumulative discounted sum of feature values or “feature expectations”E[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀πFind w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.Find w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πE[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀π Feature expectations can be readily estimated from sample trajectories. The number of expert demonstrations


View Full Document
Download Inverse Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Inverse Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Inverse Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?