Efficient computation of optimal actions Emanuel Todorov1 Departments of Applied Mathematics and Computer Science Engineering University of Washington Box 352420 Seattle WA 98195 Edited by James L McClelland Stanford University Stanford CA and approved April 28 2009 received for review November 16 2007 Optimal choice of actions is a fundamental problem relevant to fields as diverse as neuroscience psychology economics computer science and control engineering Despite this broad relevance the abstract setting is similar we have an agent choosing actions over time an uncertain dynamical system whose state is affected by those actions and a performance criterion that the agent seeks to optimize Solving problems of this kind remains hard in part because of overly generic formulations Here we propose a more structured formulation that greatly simplifies the construction of optimal control laws in both discrete and continuous domains An exhaustive search over actions is avoided and the problem becomes linear This yields algorithms that outperform Dynamic Programming and Reinforcement Learning and thereby solve traditional problems more efficiently Our framework also enables computations that were not possible before composing optimal control laws by mixing primitives applying deterministic methods to stochastic systems quantifying the benefits of error tolerance and inferring goals from behavioral data via convex optimization Development of a general class of easily solvable problems tends to accelerate progress as linear systems theory has done for example Our framework may have similar impact in fields where optimal choice of actions is relevant action selection cost function linear Bellman equation stochastic optimal control I f you are going to act you might as well act in the best way possible But which way is best This is the general problem we consider here Examples include a nervous system generating muscle activations to maximize movement performance 1 a foraging animal deciding which way to turn to maximize food 2 an internet router directing packets to minimize delays 3 an onboard computer controlling a jet engine to minimize fuel consumption 4 and an investor choosing transactions to maximize wealth 5 Such problems are often formalized as Markov decision processes MDPs with stochastic dynamics p x x u specifying the transition probability from state x to state x under action u and immediate cost x u for being in state x and choosing action u The performance criterion that the agent seeks to optimize is some cumulative cost that can be formulated in multiple ways Throughout the article we focus on one formulation total cost with terminal goal states and summarize results for other formulations Optimal actions cannot be found by greedy optimization of the immediate cost but instead must take into account all future costs This is a daunting task because the number of possible futures grows exponentially with time What makes the task doable is the optimal cost to go function v x defined as the expected cumulative cost for starting at state x and acting optimally thereafter It compresses all relevant information about the future and thus enables greedy computation of optimal actions v x equals the minimum over actions u of the immediate cost x u plus the expected cost to go E v x at the next state x v x min x u Ex p x u v x u 1 PNAS July 14 2009 Results Reducing Optimal Control to a Linear Problem We aim to construct a general class of MDPs where the exhaustive search over actions is replaced with an analytical solution Discrete optimization problems rarely have analytical solutions thus our agenda calls for continuous actions This may seem counterintuitive if one thinks of actions as symbols go left go right However what gives meaning to such symbols are the underlying transition probabilities which are continuous The latter observation is key to the framework developed here Instead of asking the agent to specify symbolic actions which are then replaced with transition probabilities we allow the agent to specify transition probabilities u x x directly Formally we have p x x u u x x Thus our agent has the power to reshape the dynamics in any way it wishes However it pays a price for too much reshaping as follows Let p x x denote the passive dynamics characterizing the behavior of the system in the absence of controls The latter will usually be defined as a random walk in discrete domains and as a diffusion process in continuous domains Note that the notion of passive dynamics is common in continuous domains but is rarely used in discrete domains We can now quantify how large an action is by measuring the difference between u x and p x Differences between probability distributions are usually measured via Kullback Leibler KL divergence suggesting an immediate cost of the form u x x x u q x KL u x p x q x Ex u x log p x x 2 The state cost q x can be an arbitrary function encoding how un desirable different states are The passive dynamics p x x and controlled dynamics u x x can also be arbitrary except that Author contributions E T designed research performed research analyzed data and wrote the paper The author declares no conflict of interest This article is a PNAS Direct Submission Freely available online through the PNAS open access option The subscript indicates that the expectation is taken with respect to the transition probability distribution p x u induced by action u Eq 1 is fundamental to optimal control theory and is called the Bellman equation It gives rise to Dynamic Programming 3 and 11478 11483 Reinforcement Learning 2 methods that are very general but can be inefficient Indeed Eq 1 characterizes v x only implicitly as the solution to an unsolved optimization problem impeding both analytical and numerical approaches Here we show how the Bellman equation can be greatly simplified We find an analytical solution for the optimal u given v and then transform Eq 1 into a linear equation Short of solving the entire problem analytically reducing optimal control to a linear equation is the best one can hope for This simplification comes at a modest price although we impose certain structure on the problem formulation most control problems of practical interest can still be handled In discrete domains our work has no precursors In continuous domains there exists related prior work 6 8 that we build on here Additional results can be found in our recent conference articles 9
View Full Document
Unlocking...