New version page

WSU CSE 6363 - Coordinated Reinforcement Learning

Upgrade to remove ads
Upgrade to remove ads
Unformatted text preview:

Coordinated Reinforcement LearningCarlos Guestrin [email protected] Science Department, Stanford University, Stanford, CA 94305Michail Lagoudakis [email protected] Parr [email protected] of Computer Science, Duke University, Durham, NC 27708AbstractWe present several new algorithms for multiagentreinforcementlearning. A common feature of thesealgorithms is a parameterized, structured represen-tation of a policy or value function. This structureis leveraged in an approach we call coordinated re-inforcement learning, by which agents coordinateboth their action selection activities and their pa-rameter updates. Within the limits of our para-metric representations, the agents will determinea jointly optimal action without explicitly consid-ering every possible action in their exponentiallylarge joint action space. Our methods differ frommany previous reinforcement learning approachesto multiagent coordination in that structured com-munication and coordination between agents ap-pears at the core of both the learning algorithm andthe execution architecture. Our experimental re-sults, comparing our approach to other RL meth-ods, illustrate both the quality of the policies ob-tained and the additional benefits of coordination.1. IntroductionConsider a system where multiple agents, each with its ownset of possible actions and its own observations, must coordi-nate in order to achieve a common goal. We want to finda mechanism for coordinating the agents’ actions so as tomaximize their joint utility. One obvious approach to thisproblem is to represent the system as a Markov Decision Pro-cess (MDP), where the “action” is a joint action for all of theagents and the reward is the total reward for all of the agents.The immediate difficulty with this approach is that the actionspace is quite large: If there are g agents, each of which cantake a actions, then the action space is ag.One natural approach to reducing the complexity of this prob-lem is to restrict the amount of information that is availableto each agent and hope to maximize global welfare by solv-ing local optimization problems for each agent [13]. In somecases, it is possible to manipulatethe presentation of informa-tion to the agents in a manner that forces local optimizationsto imply global optimizations [16]. In general, however, theproblem of findinga globally optimal solution for agents withpartial information is known to be intractable [2].Following [8], we present an approach that combines valuefunction approximation with a message passing scheme bywhich the agents efficiently determine the jointly optimal ac-tion with respect to an approximate value function. Our ap-proach is based on approximating the joint value function asa linear combination of local value functions, each of whichrelates only to the parts of the system controlled by a smallnumber of agents. We show how such factored value func-tions allow the agents to find a globally optimal joint actionusing a very natural message passing scheme. This schemecan be implemented as a negotiation procedure for selectingactions at run time. Alternatively, if the agents share a com-mon observation vector, each agent can efficiently determinethe actionsthat will be taken by all of the collaborating agentswithout any additional communication.Given an action selection mechanism, the remaining task is todevelop a reinforcement learning algorithm that is capable ofproducing value functions of the appropriate form. An algo-rithm for computing such value functions is presented in [8]for the case where the model is known and represented as afactored MDP. This is the first application of these techniquesin the context of reinforcement learning, where we no longerrequire a factored model or even a discrete state space.We begin by presenting two methods of computing an appro-priate value function through reinforcement learning: a vari-ant of Q-learning and a variant of Least Squares Policy Iter-ation (LSPI) [11]. We also demonstrate how parameterizedvalue functions of the form acquired by our reinforcementlearning variants can be combined in a very natural way withdirect policy search methods such as [12, 1, 14, 9]. The samecommunication and coordination structures used in the valuefunction approximation phase are used in the policy searchphase to sample from and update a factored stochastic policyfunction.We call our approach Coordinated Reinforcement Learning,because structured coordination between agents is used in thecore of our learning algorithms and in our execution architec-tures. Our initial experimental results with LSPI indicate thatthe message passing action selection mechanism and valuefunction approximation can be combined to produce effectivepolicies and that additional benefits are obtained with agentcoordination.2. Cooperative Action SelectionWe begin by considering the simpler problem of having agroup of agents select a globally optimal joint action to maxi-mize the sum of their individual utility functions. Supposewehave a collection of agents, where each agent j must choosean action ajfrom a finite set of possible actions Dom(Aj).We use A to denote {A1, . . . , Ag}. The agents are acting in aspace described by a set of state variables, X = {X1. . . Xn}.A state x defines a setting xjfor each variable Xjand an ac-tion a defines an action aj∈ Dom(Aj) for each agent. Theagents must choose the joint action a that maximizes the totalutility.In general, the total utility Q will depend on all state variablesX and on the actions of all agents A. However, in many prac-tical problems, it is possible to approximate the total utility Qby the sum of local sub-utilities Qj, one for each agent. Now,the total utility becomes Q =PjQj. For example, considerthe decision process of a section manager in a warehouse. Herlocal utility Qjmay depend on the state of the inventory ofher section, on her decision of which products to stock up andon the decision of the sales manager over pricing and specialoffers. On the other hand, it may not depend directly on theactions of the customer support team. However, the decisionsof the customer support team will be indirectly relevant, asthey may affect the actions of the sales manager.Computing the action that maximizes Q =PjQjseems in-tractable a priori, as it would require the enumeration of thejoint action space of all agents. Fortunately, by exploitingthe local structure in the Qj-functions through a

View Full Document
Download Coordinated Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...

Join to view Coordinated Reinforcement Learning and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Coordinated Reinforcement Learning 2 2 and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?