DOC PREVIEW
CMU CS 10701 - stephens

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Learning the Attack Policy of theApteronotus albifronsBenjamin StephensRobotics InstituteCarnegie Mellon UniversityPittsburgh, PA [email protected] IntroductionThe Apteronotus albifrons is a South American weakly electric knifefish that is being stud-ied by Malcolm MacIver at the MacIver Lab in the Biomedical Engineering departmentat Northwestern University. It produces a weak electric field and uses it to sense its sur-roundings and because they are nocturnal, it is believed that they use this sense instead ofvisual cues to track their prey. During the study of the fish, motion capture data was takenof the fish attacking a small prey, Daphnia magna, consisting of 116 trials with imagestaken at 60Hz.The motion capture data consists of the 8 DOF of the fish along with 3 DOFof the prey and was analyzed with the AnimalLab MATLAB software package [1], whichis freely available at www.neuromech.northwestern.edu/AnimalLab. Initial analysis of thedata is purely statistical and reveals small patterns in the kinematics of the fish motion [2].There are two problems of interest for the study of this fish for which machine learning canbe applied. While my project focuses on one of these problems, they are related and I willdiscuss them here. The first problem is that of system identification. The problem here isto learn an optimal control policy based on a library of observed trajectories, which includepositions of the fish and prey along with sensor estimates obtained by the fish. The secondproblem, the one I wish to address, involves inverse reinforcement learning [3]. Here wewould like to learn the reward function that governs the fish’s optimal control. Certainly,the reward function is some mixture of optimizations on control effort, time to capture, andsensing ability. In other words, the fish would like to catch the prey as fast as possible, butwill likely adjust its efforts based on how certain it is that it can catch the prey.2 Reward Function Learning Using IRLThe idea of inverse reinforcement learning (IRL) is new and has been used for such ap-plications as apprenticeship learning in driving [4] and helicopter flying [5]. IRL is usedwhen the reward function, or optimality criterion, is unknown. Instead, the reward functionis constructed from observed data that is assumed to come from an expert. For a gen-eral state space, S, and action space, A, the reward function, R(s, a), defines a mapping,R : S × A → R, which is the reward for taking action a at state s. R(s, a) is a linearapproximation given byR(s, a) = α1φ1(s, a) + α2φ2(s, a) + · · · + αNrφNr(s, a),and φi(s, a) are Nrknown basis functions. Then the value function, Vπ, for the policy, π,and R = αTφ(s, a) isVπ= α1Vπ1+ · · · + αdVπNr.The problem of finding the optimal set of parameters, α, of the reward function is a linearprogramming problem that maximizes the difference between the look-ahead value of theoptimal action, π∗, and the next best action. In the situation that we can only observe thecontrol policy through a set of example trajectories, the optimization algorithm looks likemaximizePki=1ˆVπ∗(s0) −ˆVπi(s0)s.t. |αi| ≤ 1, i = 1, . . . , Nr, (1)The algorithm for determining the reward function, R(s, a) is given as follows:1. Calculate each Vπ∗ifor data2. Generate a random base policy, π13. Fit reward using Eq.(1)4. Generate a new policy, πk+15. Repeat using Eq.(1)The learned reward function will be a linear combination of of the Nrbasis functions, eachof which are 9-dimensional gaussians, corresponding to the current state and action.2.1 Calculating VπThe value function of the optimal control is a linear combination of value functions asso-ciated with the basis reward functions. So all we need to do is calculate the value functionfor each of the Nrbasis reward functions and save the value in an array. The value for agiven reward function, φiis given byVπ∗i(s0) = Vπ∗i(s0) + γVπ∗i(s1) + γ2Vπ∗i(s2) + · · · ,where {s0, s1, . . .} are the states visited in a trajectory. This value is averaged over theeither the 116 observed trajectories or a set of simulated trajectories.2.2 Generating a policyThe random base policy is created simply by assigning a random control at each state.Using this random policy, we simulate several trajectories and again calculate the valuefunctions for each of the Nrreward functions. The averages of these values over the simu-lated trials is stored in an array for use later.Once a guess for the global reward function has been calculated, we can generate a newoptimal control policy using value iteration. For this, we use the Bellman Equation:Vπ(s) = R(s, a) + γXs′P (s, s′, a)Vπ(s′), (2)where P (s, s′, a) is a state transition probability, corresponding to the probability of reach-ing state s′by starting in state s and taking action a. The motion uncertainty is modeled asa gaussian, and this calculation is similar to calculating one step of an Extended KalmanFilter [6]. The policy is calculated byπ(s) = arg maxaQπ(s, a) (3)This calculation was performed in MATLAB using the MDP Toolbox found athttp://www.inra.fr/bia/T/MDPtoolbox/index.html. The value function for this new policyis calculated as before by simulating a number of trajectories and averaging the value overeach reward function.2.3 Fitting the reward functionAfter each iteration, we try and fit the linear combination of the basis reward functions tothe true reward function. Because the value functions for each policy are just linear com-binations of the value functions for each basis reward function, the maximization problemin Eq.(1) becomesmaximizePKk=1PNri=1ˆVπ∗i(s0) −ˆVπki(s0)αis.t. |αi| ≤ 1, i = 1, . . . , Nr. (4)This problem is solved using the MATLAB fmincon function.3 ExperimentsThe state of the fish in the motion capture data is described by 6 degrees of freedom corre-sponding to 6 rigid body coordinates centered at the snout. The state of the prey is given bythe 3 translation coordinates at the center of the body, which is very small. We will assumea simple sensing model based on a the position of the prey with respect to the fish.The control actions of the fish are observed data, equal to the velocities rigid body coordi-nate directions. The control policy can be thought of as a regulator, with the state given bythe relative position between the fish and the prey. The desired trajectory will regulate theerror between the fish


View Full Document

CMU CS 10701 - stephens

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download stephens
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view stephens and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view stephens 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?