Page 1Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter AbbeelUC Berkeley EECSIn collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun. Key idea: learning from demonstrations Concretely inverse reinforcement learning Has enabled advancing the state of the art in various robotic domains.OverviewPage 2Reinforcement learning / Optimal controlSystemDynamicsPPPPsasasasastate ssss0ssss1Systemdynamics PPPPsasasasa…SystemDynamicsPPPPsasasasassssTTTT-1ssssTTTTssss2aaaa0aaaa1aaaaT-1reward RRRR(ssss0)RRRR(ssss2) RRRR(ssssTTTT-1)RRRR(ssss1) RRRR(ssssTTTT)+ ++…++Goal: Pick actions over time so as to maximize the expected score: E[R(s) + R(s) + … + R(sT)]Solution: controller ,which specifies an action for each possible state for all times t= 0, 1, … , T-1.Examples: car driving, helicopter flight, legged locomotion; load balancing, pricing, ad placement, …Example task: drivingPage 3 Input: State space, action space Transition model Psa(st+1| st, at)Noreward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …(= trace of the teacher’s policy π*) Inverse reinforcement learning:Can we recover R from the teacher’s demonstration?Problem setup Alleviate the need for specifying a reward function, which can be hard in practice--- Several example applications in this lecture Modeling and understanding of behaviour Biological behaviour Multi-agent systems: understand (exploit?!) the other agentsApplicationsPage 4Inverse reinforcement learningE[Tt=0R(st)|π∗] ≥ E[Tt=0R(st)|π] ∀π = π∗ Condition for the reward function R to be consistent with the teacher’s policy π*: Find the reward function that maximizes the margin by which the teacher outperforms a set of other policies: Two technical aspects unaddressed in this lecture: How to generate a good set of alternative policies How to compute the expected sum of rewards for the teacher’s policy (we only have a trace)Inverse reinforcement learningPage 5Related work to Abbeel and Ng, 2004 Prior work: Behavioral cloning. Utility elicitation / Inverse reinforcement learning, Ng & Russell, 2000. Closely related later work: Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Ramachandran and Amir, 2007; Syed & Schapire, 2008; Ziebart et al., 2008; … Work on specialized reward function: trajectories. E.g., Atkeson & Schaal, 1997.Highway drivingTeacher in Training World Learned Policy in Testing World Input: Dynamics model / Simulator Psa(st+1| st, at) Teacher’s demonstration: 1 minute in “training world” Note: R* is unknown. Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distancesPage 6More driving examplesIn each video, the left sub-panel shows ademonstration of a different driving“style”, and the right sub-panel showsthe behavior learned from watching thedemonstration.Driving demonstrationDriving demonstrationLearned behaviorLearned behaviorParking lot navigation[Abbeel et al., IROS 08] Reward function trades off: Staying “on-road,” Forward vs. reverse driving, Amount of switching between forward and reverse, Lane keeping, On-road vs. off-road, Curvature of paths.Page 7 Demonstrate parking lot navigation on “train parking lots.” Run our apprenticeship learning algorithm to find the reward function. Receive “test parking lot” map + starting point and destination. Find the trajectory that maximizes the learned reward functionfor navigating the test parking lot.Experimental setupNice driving stylePage 8Sloppy driving-style“Don’t mind reverse” driving-stylePage 9 Reward function trades off 25 features.Quadruped[Kolter, Abbeel & Ng, 2008] Demonstrate path across the “training terrain” Run our apprenticeship learning algorithm to find the reward function Receive “testing terrain”---height map. Find the optimal policy with respect to the learned reward functionfor crossing the testing terrain.Experimental setupPage 10Without learningWith learned reward functionPage 11 Key idea: learning from demonstrations Concretely inverse reinforcement learning Has enabled advancing the state of the art in various robotic domains.Recap How does helicopter dynamics work Autonomous helicopter setup Application of inverse RL to autonomous helicopter flightRemainder of lecture: application to extreme helicopter flightPage 12 4 control inputs: Main rotor collective pitch Main rotor cyclic pitch (roll and pitch) Tail rotor collective pitchHelicopter dynamicsAutonomous helicopter setupOn-Board Inertial Measurements Unit (IMU) dataSend out controls to helicopter1. Kalman filter2.Feedback controllerPosition dataPage 13Related work Bagnell & Schneider, 2001; LaCivita, Papageorgiou, Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001); Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., 2004. Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b.Maneuvers presented here are significantly more challenging and more diverse than those performed by any other autonomous helicopter.1. Our expert pilot demonstrates the airshow several times.Experimental setup for helicopterPage 14Demonstrations1. Our expert pilot demonstrates the airshow several times.2. Learn a reward function---trajectory.3. Learn a dynamics model.Experimental setup for helicopterPage 15Learned reward (trajectory)1. Our expert pilot demonstrates the airshow several times.2. Learn A. Reward function.B. Dynamics model.3. Find the optimal control policy for learned reward and dynamics model.4. Autonomously fly the airshow5. Learn an improved dynamics model. Go back to step 4.Experimental setup for helicopterPage 16Thank you.Page
View Full Document