Berkeley COMPSCI 287 - Lecture 16: imitation learning - D1888885

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 287> Lecture 16: imitation learning

DOC PREVIEW

Berkeley COMPSCI 287 - Lecture 16: imitation learning

School name University of California, Berkeley

Course Compsci 287- Advanced Robotics

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 16: imitation learningPieter AbbeelUC Berkeley EECSPage 2 state: board configuration + shape of the falling piece ~2200states! action: rotation and translation applied to the falling pieceBehavioral cloning example 22 features aka basis functions φi Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each of the ten columns. Nine basis functions, 10, . . . , 18, each mapping the state to the absolute difference between heights of successive columns: |h[k+1] − h[k]|, k = 1, . . . , 9. One basis function, 19, that maps state to the maximum column height: maxkh[k] One basis function, 20, that maps state to the number of ’holes’ in the board. One basis function, 21, that is equal to 1 in every state.Behavioral cloning example[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]V (s) =22i=1θiφi(s)Page 3Behavioral cloning exampleBehavioral cloning examplePage 4Training data: Example choices of next states chosen by the demonstrator:s(i)+Alternative choices of next states that were available: s(i)j−Max-margin formulationminθ,ξ≥0θ⊤θ + Ci,jξi,jsubject to ∀i, ∀j : θ⊤φ(s(i)+) ≥ θ⊤φ(s(i)j−) + 1 − ξi,jProbabilistic/Logistic formulationAssumes experts choose for result s(i)with probabilityexp(θ⊤φ(s(i)+))exp(θ⊤φ(s(i)+)+j−exp(θ⊤φ(s(i)j−).Hence the maximum likelihood estimate is given by:maxθilog exp(θ⊤φ(s(i)+))exp(θ⊤φ(s(i)+)) +j−exp(θ⊤φ(s(i)j−)− CθBehavioral cloning example Scientific inquiry Model animal and human behavior E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.] Apprenticeship learning/Imitation learning through inverse RL Presupposition: reward function provides the most succinct and transferable definition of the task Has enabled advancing the state of the art in various robotic domains Modeling of other agents, both adversarial and cooperativeMotivation for inverse RLPage 5 Input:  State space, action space Transition model Psa(st+1| st, at) No reward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …(= trace of the teacher’s policy π*) Inverse RL:  Can we recover R ? Apprenticeship learning via inverse RL Can we then use this R to find a good policy ? Vs. Behavioral cloning (which directly learns the teacher’s policy using supervised learning) Inverse RL: leverages compactness of the reward function Behavioral cloning: leverages compactness of the policy class considered, does not require a dynamics modelProblem setup Inverse RL intro Mathematical formulations for inverse RL Case studiesLecture outlinePage 6Three broad categories of formalizations Max margin  Feature expectation matching  Interpret reward function as parameterization of a policy class Find a reward function R* which explains the expert behaviour. Find R* such that Equivalently, find R* such that A convex feasibility problem in R*, but many challenges: R=0 is a solution, more generally: reward function ambiguity We typically only observe expert traces rather than the entire expert policy π* --- how to compute LHS? Assumes the expert is indeed optimal --- otherwise infeasible Computationally: assumes we can enumerate all policiesBasic principleE[∞t=0γtR∗(st)|π∗] ≥ E[∞t=0γtR∗(st)|π] ∀πs∈SR(s) ∞t=0γt1{st= s|π∗}≥s∈SR(s) ∞t=0γt1{st= s|π∗}∀πPage 7 ffFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.E[∞t=0γtR(st)|π] =x ff Subbing intogives us: Feature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.E[∞t=0γtR(st)|π] = E[∞t=0γtw⊤φ(st)|π]= w⊤E[∞t=0γtφ(st)|π]= w⊤µ(π)Expected cumulative discounted sum of feature values or “feature expectations”E[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀πFind w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πPage 8 Feature expectations can be readily estimated from sample trajectories. The number of expert demonstrations required scales with the number of features in the reward function. The number of expert demonstration required does not depend on Complexity of the expert’s optimal policy π* Size of the state spaceFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.Find w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πE[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀π Challenges: Assumes we know the entire expert policy π*  assumes we can estimate expert feature expectations R=0 is a solution (now: w=0), more generally: reward function ambiguity Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness! Computationally: assumes we can enumerate all policiesRecap of challengesLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.Find w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πPage 9 We currently have:  Standard max margin: “Structured prediction” max margin: AmbiguityxFind w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀π Standard max margin: “Structured prediction” max margin:  Justification: margin should be larger for policies that are very different from π*. Example: m(π, π*) = number of states in which π* was observed and in which π and π* disagreeAmbiguityminww22s.t. w⊤µ(π∗)≥w⊤µ(π) + 1∀πminww22s.t. w⊤µ(π∗)≥w⊤µ(π) + m(π∗, π)∀πPage 10 Structured prediction max margin:Expert suboptimalityxminww22s.t. w⊤µ(π∗)≥w⊤µ(π) + m(π∗, π)∀π Structured prediction max margin with slack variables: Can be generalized to multiple MDPs (could also be same MDP with different initial state)Expert suboptimalityminw,ξw22+ Cξs.t. w⊤µ(π∗)≥w⊤µ(π) + m(π∗, π)−ξ∀πminw,ξ(i)w22+ Ciξ(i)s.t. w⊤µ(π(i)∗)≥w⊤µ(π(i)) + m(π(i)∗, π(i))−ξ(i)∀i, π(i)Page 11 Resolved: access to π*, ambiguity, expert suboptimality One challenge remains: very large number of constraints Ratliff+al use subgradient methods. In this lecture: constraint generationComplete

View Full Document