DOC PREVIEW
Berkeley COMPSCI 287 - A Compact, Hierarchically Optimal Q-function Decomposition

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

A Compact, Hierarchically Optimal Q-function DecompositionBhaskara Marthi, Stuart RussellDepartment of Computer ScienceUniversity of California, BerkeleyBerkeley, CA 94720David AndreBodyMedia Inc.Pittsburgh, PA 15222AbstractPrevious work in hierarchical reinforcement learninghas faced a dilemma: either ignore the values of differ-ent possible exit states from a subroutine, thereby riskingsuboptimal behavior, or represent those values explicitlythereby incurring a possibly large representation cost be-cause exit values refer to nonlocal aspects of the world(i.e., all subsequent rewards). This paper shows that, inmany cases, one can avoid both of these problems. The so-lution is based on recursively decomposing the exit valuefunction in terms of Q-functions at higher levels of the hi-erarchy. This leads to an intuitively appealing runtime ar-chitecture in which a parent subroutine passes to its childa value function on the exit states and the child reasonsabout how its choices affect the exit value. We also identifystructural conditions on the value function and transitiondistributions that allow much more concise representationsof exit state distributions, leading to further state abstrac-tion. In essence, the only variables whose exit values needbe considered are those that the parent cares about and thechild affects. We demonstrate the utility of our algorithmson a series of increasingly complex environments.1 IntroductionHierarchical reinforcement learning (HRL) aims to speedup RL by providing prior knowledge in the form of hi-erarchical constraints on policies[Parr and Russell, 1997;Dietterich, 2000]. In this paper, we consider policies con-strained to follow a partial program[Andre and Russell,2002]whose choice points designate places where the pol-icy is unspecified. A learning agent equipped with such apartial program aims to find the best possible completion,given a set of experiences running the program in an MDP.The fundamental result of HRL is that the combina-tion of a partial program and an MDP yields a new semi-Markov decision process (SMDP) whose states ω are jointstates—pairs of environment states and program states.HRL algorithms solve this SMDP; here, we consider al-gorithms that learn the function Q(ω, u), i.e., the expectedsum of rewards when taking action u in joint state ω andacting optimally thereafter. (Note that u may be a tempo-rally extended subroutine invocation as well as a primitiveaction.)A principal advantage of HRL over “flat” methods isthat imposing a hierarchical structure on behavior exposesstructure in the value function of the underlying MDP.Roughly speaking, the complete reward sequence for anyrun of a policy in the MDP can be divided into subse-quences, each associated with execution of a particularsub-routine; this allows the global value function for the hierar-chical policy to be represented as a sum of value functioncomponents, each (in the ideal case) defined over only asmall set of state variables.Two standard forms of Q-decomposition are known.MAXQ[Dietterich, 2000]approximates Q(ω, u) as a sumof Qr(ω, u), the expected sum of rewards obtained whileu is executing, and Qc(ω, u), the expected sum of rewardsafter u until the current subroutine completes.1The AL-isp decomposition[Andre and Russell, 2002]includes athird term, Qe(ω, u), to cover rewards obtained after thesubroutine exits. These two decompositions lead to twodifferent notions of optimal completion. MAXQ yields re-cursive optimality—i.e., the policy within each subroutineis optimized ignoring the calling context. ALisp yields hi-erarchical optimality—i.e., optimality among all policiesconsistent with the partial program.Recursively optimal policies may be worse than hier-archically optimal ones if context is relevant, because theyignore the exit values Qe. On the other hand, ALisp mayneed more samples to learn Qeaccurately. It can also beargued that, whereas Qrand Qcrepresent sums of rewardsthat are local to a given subroutine, and hence likely to bewell approximated by a low-dimensional function, Qerep-resents a non-local sum of rewards and may, therefore, bevery difficult to learn. Thus, we seem to have a dilemma:ignore Qeand risk seriously suboptimal behavior, or in-clude Qeand incur a potentially high learning cost.This paper shows that, in many cases, one can avoidboth of these problems. The solution, described in Sec-tion 4, is based on recursively decomposing the exit value1MAXQ also allows a “pseudoreward” function for exit val-ues, but this must be specified in advance by the programmer.function in terms of Q-functions at higher levels of the hi-erarchy. New algorithms for learning and execution basedon this decomposition are presented. In Section 5, we showhow the idea of additive irrelevance can be used to dramat-ically further simplify the representation based on condi-tional independencies that may hold in the domain. Finally,in Section 6, we describe a more general simplification thatdepends on the size of the “interface” between a subroutineand its context. The utility of each successive decomposi-tion is demonstrated experimentally.2 Main exampleIn this section, we describe our main illustrative example.We begin with a very simple version of the environment,which will be made more complex over the course of thepaper, as we introduce the various components of our ap-proach. We use Markov decision processes as our mod-elling framework, and further assume that the state is rep-resented as an instantiation to a set V of state variables.The example is based on one from[Dietterich, 2000].Example 1. The taxi MDP models a taxi moving aroundand serving passengers on a grid. States in the MDP con-sist of the taxi’s position and a set of passengers, eachhaving a source, destination, generosity level, and status(at-source or in-taxi). The actions available to thetaxi are: moves in the four cardinal directions, which mayfail with some probability; pickup, which picks up thepassenger at the taxi’s current position, assuming the taxidoesn’t already have a passenger; and dropoff, whichdrops off the passenger who is currently in the taxi, as-suming the taxi is at her destination. Upon a successfuldropoff action, with some probability the episode ter-minates, and otherwise, the set of passengers is reset2us-ing some distribution over each passenger’s source, desti-nation, and generosity (but the taxi stays at its current po-sition). A fixed


View Full Document
Download A Compact, Hierarchically Optimal Q-function Decomposition
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Compact, Hierarchically Optimal Q-function Decomposition and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Compact, Hierarchically Optimal Q-function Decomposition 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?