Page 1CS 287: Advanced RoboticsFall 2009Lecture 16: imitation learningPieter AbbeelUC Berkeley EECSPage 2 state: board configuration + shape of the falling piece ~2200states! action: rotation and translation applied to the falling pieceBehavioral cloning example 22 features aka basis functions φi Ten basis functions, 0, . . . , 9, mapping the state to the height h[k] of each of the ten columns. Nine basis functions, 10, . . . , 18, each mapping the state to the absolute difference between heights of successive columns: |h[k+1] − h[k]|, k = 1, . . . , 9. One basis function, 19, that maps state to the maximum column height: maxkh[k] One basis function, 20, that maps state to the number of ’holes’ in the board. One basis function, 21, that is equal to 1 in every state.Behavioral cloning example[Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]V (s) =22i=1θiφi(s)Page 3Behavioral cloning exampleBehavioral cloning examplePage 4Training data: Example choices of next states chosen by the demonstrator:s(i)+Alternative choices of next states that were available: s(i)j−Max-margin formulationminθ,ξ≥0θ⊤θ + Ci,jξi,jsubject to ∀i, ∀j : θ⊤φ(s(i)+) ≥ θ⊤φ(s(i)j−) + 1 − ξi,jProbabilistic/Logistic formulationAssumes experts choose for result s(i)with probabilityexp(θ⊤φ(s(i)+))exp(θ⊤φ(s(i)+)+j−exp(θ⊤φ(s(i)j−).Hence the maximum likelihood estimate is given by:maxθilog exp(θ⊤φ(s(i)+))exp(θ⊤φ(s(i)+)) +j−exp(θ⊤φ(s(i)j−)− CθBehavioral cloning example Scientific inquiry Model animal and human behavior E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.] Apprenticeship learning/Imitation learning through inverse RL Presupposition: reward function provides the most succinct and transferable definition of the task Has enabled advancing the state of the art in various robotic domains Modeling of other agents, both adversarial and cooperativeMotivation for inverse RLPage 5 Input: State space, action space Transition model Psa(st+1| st, at) No reward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …(= trace of the teacher’s policy π*) Inverse RL: Can we recover R ? Apprenticeship learning via inverse RL Can we then use this R to find a good policy ? Vs. Behavioral cloning (which directly learns the teacher’s policy using supervised learning) Inverse RL: leverages compactness of the reward function Behavioral cloning: leverages compactness of the policy class considered, does not require a dynamics modelProblem setup Inverse RL intro Mathematical formulations for inverse RL Case studiesLecture outlinePage 6Three broad categories of formalizations Max margin Feature expectation matching Interpret reward function as parameterization of a policy class Find a reward function R* which explains the expert behaviour. Find R* such that Equivalently, find R* such that A convex feasibility problem in R*, but many challenges: R=0 is a solution, more generally: reward function ambiguity We typically only observe expert traces rather than the entire expert policy π* --- how to compute LHS? Assumes the expert is indeed optimal --- otherwise infeasible Computationally: assumes we can enumerate all policiesBasic principleE[∞t=0γtR∗(st)|π∗] ≥ E[∞t=0γtR∗(st)|π] ∀πs∈SR(s) ∞t=0γt1{st= s|π∗}≥s∈SR(s) ∞t=0γt1{st= s|π∗}∀πPage 7 ffFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.E[∞t=0γtR(st)|π] =x ff Subbing intogives us: Feature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.E[∞t=0γtR(st)|π] = E[∞t=0γtw⊤φ(st)|π]= w⊤E[∞t=0γtφ(st)|π]= w⊤µ(π)Expected cumulative discounted sum of feature values or “feature expectations”E[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀πFind w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πPage 8 Feature expectations can be readily estimated from sample trajectories. The number of expert demonstrations required scales with the number of features in the reward function. The number of expert demonstration required does not depend on Complexity of the expert’s optimal policy π* Size of the state spaceFeature based reward functionLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.Find w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πE[∞t=0γtR∗(st)|π∗]≥E[∞t=0γtR∗(st)|π]∀π Challenges: Assumes we know the entire expert policy π* assumes we can estimate expert feature expectations R=0 is a solution (now: w=0), more generally: reward function ambiguity Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness! Computationally: assumes we can enumerate all policiesRecap of challengesLet R(s) = w⊤φ(s), where w∈ℜn, and φ : S→ℜn.Find w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀πPage 9 We currently have: Standard max margin: “Structured prediction” max margin: AmbiguityxFind w∗such that w∗⊤µ(π∗)≥w∗⊤µ(π)∀π Standard max margin: “Structured prediction” max margin: Justification: margin should be larger for policies that are very different from π*. Example: m(π, π*) = number of states in which π* was observed and in which π and π* disagreeAmbiguityminww22s.t. w⊤µ(π∗)≥w⊤µ(π) + 1∀πminww22s.t. w⊤µ(π∗)≥w⊤µ(π) + m(π∗, π)∀πPage 10 Structured prediction max margin:Expert suboptimalityxminww22s.t. w⊤µ(π∗)≥w⊤µ(π) + m(π∗, π)∀π Structured prediction max margin with slack variables: Can be generalized to multiple MDPs (could also be same MDP with different initial state)Expert suboptimalityminw,ξw22+ Cξs.t. w⊤µ(π∗)≥w⊤µ(π) + m(π∗, π)−ξ∀πminw,ξ(i)w22+ Ciξ(i)s.t. w⊤µ(π(i)∗)≥w⊤µ(π(i)) + m(π(i)∗, π(i))−ξ(i)∀i, π(i)Page 11 Resolved: access to π*, ambiguity, expert suboptimality One challenge remains: very large number of constraints Ratliff+al use subgradient methods. In this lecture: constraint generationComplete
View Full Document