Learning the Attack Policy of the Apteronotus albifrons Benjamin Stephens Robotics Institute Carnegie Mellon University Pittsburgh PA 15213 bstephens cmu edu 1 Introduction The Apteronotus albifrons is a South American weakly electric knifefish that is being studied by Malcolm MacIver at the MacIver Lab in the Biomedical Engineering department at Northwestern University It produces a weak electric field and uses it to sense its surroundings and because they are nocturnal it is believed that they use this sense instead of visual cues to track their prey During the study of the fish motion capture data was taken of the fish attacking a small prey Daphnia magna consisting of 116 trials with images taken at 60Hz The motion capture data consists of the 8 DOF of the fish along with 3 DOF of the prey and was analyzed with the AnimalLab MATLAB software package 1 which is freely available at www neuromech northwestern edu AnimalLab Initial analysis of the data is purely statistical and reveals small patterns in the kinematics of the fish motion 2 There are two problems of interest for the study of this fish for which machine learning can be applied While my project focuses on one of these problems they are related and I will discuss them here The first problem is that of system identification The problem here is to learn an optimal control policy based on a library of observed trajectories which include positions of the fish and prey along with sensor estimates obtained by the fish The second problem the one I wish to address involves inverse reinforcement learning 3 Here we would like to learn the reward function that governs the fish s optimal control Certainly the reward function is some mixture of optimizations on control effort time to capture and sensing ability In other words the fish would like to catch the prey as fast as possible but will likely adjust its efforts based on how certain it is that it can catch the prey 2 Reward Function Learning Using IRL The idea of inverse reinforcement learning IRL is new and has been used for such applications as apprenticeship learning in driving 4 and helicopter flying 5 IRL is used when the reward function or optimality criterion is unknown Instead the reward function is constructed from observed data that is assumed to come from an expert For a general state space S and action space A the reward function R s a defines a mapping R S A R which is the reward for taking action a at state s R s a is a linear approximation given by R s a 1 1 s a 2 2 s a Nr Nr s a and i s a are Nr known basis functions Then the value function V for the policy and R T s a is V 1 V1 d VN r The problem of finding the optimal set of parameters of the reward function is a linear programming problem that maximizes the difference between the look ahead value of the optimal action and the next best action In the situation that we can only observe the control policy through a set of example trajectories the optimization algorithm looks like Pk maximize s0 V i s0 i 1 V 1 s t i 1 i 1 Nr The algorithm for determining the reward function R s a is given as follows 1 2 3 4 5 Calculate each Vi for data Generate a random base policy 1 Fit reward using Eq 1 Generate a new policy k 1 Repeat using Eq 1 The learned reward function will be a linear combination of of the Nr basis functions each of which are 9 dimensional gaussians corresponding to the current state and action 2 1 Calculating V The value function of the optimal control is a linear combination of value functions associated with the basis reward functions So all we need to do is calculate the value function for each of the Nr basis reward functions and save the value in an array The value for a given reward function i is given by Vi s0 Vi s0 Vi s1 2 Vi s2 where s0 s1 are the states visited in a trajectory This value is averaged over the either the 116 observed trajectories or a set of simulated trajectories 2 2 Generating a policy The random base policy is created simply by assigning a random control at each state Using this random policy we simulate several trajectories and again calculate the value functions for each of the Nr reward functions The averages of these values over the simulated trials is stored in an array for use later Once a guess for the global reward function has been calculated we can generate a new optimal control policy using value iteration For this we use the Bellman Equation X V s R s a P s s a V s 2 s where P s s a is a state transition probability corresponding to the probability of reaching state s by starting in state s and taking action a The motion uncertainty is modeled as a gaussian and this calculation is similar to calculating one step of an Extended Kalman Filter 6 The policy is calculated by s arg max Q s a 3 a This calculation was performed in MATLAB using the MDP Toolbox found at http www inra fr bia T MDPtoolbox index html The value function for this new policy is calculated as before by simulating a number of trajectories and averaging the value over each reward function 2 3 Fitting the reward function After each iteration we try and fit the linear combination of the basis reward functions to the true reward function Because the value functions for each policy are just linear combinations of the value functions for each basis reward function the maximization problem in Eq 1 becomes PK PNr k maximize V s V s i 0 0 i i k 1 i 1 4 s t i 1 i 1 Nr This problem is solved using the MATLAB fmincon function 3 Experiments The state of the fish in the motion capture data is described by 6 degrees of freedom corresponding to 6 rigid body coordinates centered at the snout The state of the prey is given by the 3 translation coordinates at the center of the body which is very small We will assume a simple sensing model based on a the position of the prey with respect to the fish The control actions of the fish are observed data equal to the velocities rigid body coordinate directions The control policy can be thought of as a regulator with the state given by the relative position between the fish and the prey The desired trajectory will regulate the error between the fish and the prey to zero and will have imposed constraints for proper orientation A good choice for the task space is the spherical coordinate representation of the error between the fish and the prey If the position of the fish is given by pfish and the position of the prey is given by pprey …
View Full Document