DOC PREVIEW
CMU CS 10701 - stephens

This preview shows page 1-2 out of 6 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Learning the Attack Policy of the Apteronotus albifrons Benjamin Stephens Robotics Institute Carnegie Mellon University Pittsburgh PA 15213 bstephens cmu edu 1 Introduction The Apteronotus albifrons is a South American weakly electric knifefish that is being studied by Malcolm MacIver at the MacIver Lab in the Biomedical Engineering department at Northwestern University It produces a weak electric field and uses it to sense its surroundings and because they are nocturnal it is believed that they use this sense instead of visual cues to track their prey During the study of the fish motion capture data was taken of the fish attacking a small prey Daphnia magna consisting of 116 trials with images taken at 60Hz The motion capture data consists of the 8 DOF of the fish along with 3 DOF of the prey and was analyzed with the AnimalLab MATLAB software package 1 which is freely available at www neuromech northwestern edu AnimalLab Initial analysis of the data is purely statistical and reveals small patterns in the kinematics of the fish motion 2 There are two problems of interest for the study of this fish for which machine learning can be applied While my project focuses on one of these problems they are related and I will discuss them here The first problem is that of system identification The problem here is to learn an optimal control policy based on a library of observed trajectories which include positions of the fish and prey along with sensor estimates obtained by the fish The second problem the one I wish to address involves inverse reinforcement learning 3 Here we would like to learn the reward function that governs the fish s optimal control Certainly the reward function is some mixture of optimizations on control effort time to capture and sensing ability In other words the fish would like to catch the prey as fast as possible but will likely adjust its efforts based on how certain it is that it can catch the prey 2 Reward Function Learning Using IRL The idea of inverse reinforcement learning IRL is new and has been used for such applications as apprenticeship learning in driving 4 and helicopter flying 5 IRL is used when the reward function or optimality criterion is unknown Instead the reward function is constructed from observed data that is assumed to come from an expert For a general state space S and action space A the reward function R s a defines a mapping R S A R which is the reward for taking action a at state s R s a is a linear approximation given by R s a 1 1 s a 2 2 s a Nr Nr s a and i s a are Nr known basis functions Then the value function V for the policy and R T s a is V 1 V1 d VN r The problem of finding the optimal set of parameters of the reward function is a linear programming problem that maximizes the difference between the look ahead value of the optimal action and the next best action In the situation that we can only observe the control policy through a set of example trajectories the optimization algorithm looks like Pk maximize s0 V i s0 i 1 V 1 s t i 1 i 1 Nr The algorithm for determining the reward function R s a is given as follows 1 2 3 4 5 Calculate each Vi for data Generate a random base policy 1 Fit reward using Eq 1 Generate a new policy k 1 Repeat using Eq 1 The learned reward function will be a linear combination of of the Nr basis functions each of which are 9 dimensional gaussians corresponding to the current state and action 2 1 Calculating V The value function of the optimal control is a linear combination of value functions associated with the basis reward functions So all we need to do is calculate the value function for each of the Nr basis reward functions and save the value in an array The value for a given reward function i is given by Vi s0 Vi s0 Vi s1 2 Vi s2 where s0 s1 are the states visited in a trajectory This value is averaged over the either the 116 observed trajectories or a set of simulated trajectories 2 2 Generating a policy The random base policy is created simply by assigning a random control at each state Using this random policy we simulate several trajectories and again calculate the value functions for each of the Nr reward functions The averages of these values over the simulated trials is stored in an array for use later Once a guess for the global reward function has been calculated we can generate a new optimal control policy using value iteration For this we use the Bellman Equation X V s R s a P s s a V s 2 s where P s s a is a state transition probability corresponding to the probability of reaching state s by starting in state s and taking action a The motion uncertainty is modeled as a gaussian and this calculation is similar to calculating one step of an Extended Kalman Filter 6 The policy is calculated by s arg max Q s a 3 a This calculation was performed in MATLAB using the MDP Toolbox found at http www inra fr bia T MDPtoolbox index html The value function for this new policy is calculated as before by simulating a number of trajectories and averaging the value over each reward function 2 3 Fitting the reward function After each iteration we try and fit the linear combination of the basis reward functions to the true reward function Because the value functions for each policy are just linear combinations of the value functions for each basis reward function the maximization problem in Eq 1 becomes PK PNr k maximize V s V s i 0 0 i i k 1 i 1 4 s t i 1 i 1 Nr This problem is solved using the MATLAB fmincon function 3 Experiments The state of the fish in the motion capture data is described by 6 degrees of freedom corresponding to 6 rigid body coordinates centered at the snout The state of the prey is given by the 3 translation coordinates at the center of the body which is very small We will assume a simple sensing model based on a the position of the prey with respect to the fish The control actions of the fish are observed data equal to the velocities rigid body coordinate directions The control policy can be thought of as a regulator with the state given by the relative position between the fish and the prey The desired trajectory will regulate the error between the fish and the prey to zero and will have imposed constraints for proper orientation A good choice for the task space is the spherical coordinate representation of the error between the fish and the prey If the position of the fish is given by pfish and the position of the prey is given by pprey …


View Full Document

CMU CS 10701 - stephens

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download stephens
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view stephens and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view stephens and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?