DOC PREVIEW
Berkeley COMPSCI 294 - Discovering Hierarchy in Reinforcement Learning with HEXQ

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Discovering Hierarchy in Reinforcement Learning with HEXQBernhard Hengst [email protected] Science and Engineering, University of New South Wales, UNSW Sydney 2052 AUSTRALIAAbstractAn open problem in reinforcement learningis discovering hierarchical structure. HEXQ,an algorithm which automatically attemptsto decompose and solve a model-free fac-tored MDP hierarchically is described. Bysearching for aliased Markov sub-space re-gions based on the state variables the algo-rithm uses temporal and state abstraction toconstruct a hierarchy of interlinked smallerMDPs.1. IntroductionBellman (1961) stated that sheer enumeration wouldnot solve problems of any significance. In reinforce-ment learning the size of the state space scales ex-ponentially with the number of variables. Designerstry to manually decompose more complex problems tomake them tractable. Finding good decompositionsis usually an art-form. Many researchers have eitherignored where decompositions come from or pointedto the desirability of automating this task (Boutilieret al., 1999; Hauskrecht et al., 1998; Dean & Lin,1995). More recently, Dietterich (2000b) concludedthat the biggest open problem in reinforcement learn-ing is to discover hierarchical structure.It was recognised by Ashby (1956) that learning isworthwhile only when the environment shows con-straint. One type of constraint present in many en-vironments is the repetition of sub-structures. Ashbystated that repetition is of considerable practical im-portance in the regulation of very large systems. Rep-etitions are commonplace. They are evident, for exam-ple, at the molecular level, in daily routines, in officelayouts or even in just walking. One reason that re-inforcement learning scales poorly is that sub-policies,such as walking, need to be relearnt in every context.It makes more sense to learn how to walk only onceand then reuse this skill wherever it is required. Areinforcement learning agent that can find and learnreusable sub-tasks and in turn employ them to learnhigher level skills would be more efficient.In the rest of this paper we will describe the opera-tion of a hierarchical reinforcement learning algorithm,HEXQ, which attempts to solve model-free MDPsmore efficiently by finding and exploiting repeatablesub-structures in the environment. The algorithm isdesigned to automatically discover state and temporalabstractions, find appropriate sub-goals and constructa hierarchical representation to solve the overall MDP.As a running example we will use the taxi task (Diet-terich, 2000a) to illustrate how the algorithm works.We also show results for a noisy Tower of Hanoi puzzle.2. Representation and AssumptionsWe start with the usual formulation of a finite MDPwith discrete time steps, states and actions (Sutton& Barto, 1998). The objective is to find an optimalpolicy by maximising the expected value of future dis-counted rewards represented by the action-value func-tion, Q (Watkins & Dayan, 1992). We also employsemi-MDP theory (Puterman, 1994) which generalizesMDPs to models with variable time between decisions.We assume that the state is defined by a vector of dstate variables, x = (x1, x2, ..., xd). Large MDPs arenaturally described in this factored form. In this pa-per we consider only negative reward non-discountedfinite horizon MDPs (stochastic shortest path prob-lems), but the algorithm has been extended to han-dle general finite MDPs. The issue of solving MDPsefficiently is largely orthogonal and complementary tothe decomposition techniques discussed here. We haveused simple one-step backup Q-learning throughout.HEXQ attempts to decompose a MDP by dividingthe state space into nested sub-MDP regions. De-composition is possible when (1) some of the vari-ables in the state vector represent features in the en-vironment that change at less frequent time intervals,(2) variables that change value more frequently retaintheir transition properties in the context of the morepersistent variables and (3) the interface between re-gions can be controlled. For example, if a robot nav-igates around four equally sized rooms with intercon-necting doorways (Parr, 1998) the state space can berepresented by the two variables, room-identifier andposition-in-room. The room changes less frequentlythan the position. The position in each room needsto be represented consistently to allow generalisationacross rooms, for example, by numbering cells from topto bottom, left to right in each room. Most represen-tations naturally label repeated sub-structures in thisway. Finally we need to be able to find sub-policies toexit through each doorway with certainty. This willbecome clearer in the next sections. In the absenceof these conditions or when they are only partiallypresent HEXQ will nevertheless solve the MDP dis-covering abstractions where it can. In the worst caseit has to solve the ‘flat’ problem.3. The Taxi DomainDietterich (2000a) created the taxi task (Figure 1) todemonstrate MAXQ hierarchical reinforcement learn-ing. For MAXQ the structure of the hierarchy is spec-ified by the user. We will use the same domain toillustrate how hierarchical decomposition can be au-tomated. We will keep our description of HEXQ gen-eral, but use the taxi domain to illustrate the basicconcepts. We start by reviewing the taxi problem.In the taxi domain, a taxi, started at a random loca-tion, navigates around a 5-by-5 grid world to pick upand then put down a passenger. There are four possi-ble source and destination locations, designated R, G,Y and B. We encode these 1, 2, 3, 4 respectively. Theyare called taxi ranks. The objective of the taxi agentis to go to the source rank, pick up the passenger, thennavigate with the passenger in the taxi to the desti-nation rank and put down the passenger. The sourceand destination ranks are also chosen at random foreach new trial. At each step the taxi can perform oneof six primitive actions, move one square to the north,south, east or west, pickup or putdown the passenger.A move into a wall or barrier leaves the taxi locationunchanged. For a successful passenger delivery the re-ward is 20. If the taxi executes a pickup action at alocation without the passenger or a putdown action atthe wrong destination it receives a reward of -10. Forall other steps the reward is -1. The trial terminatesfollowing a successful delivery.The taxi problem can be formulated as an episodicMDP with the 3 state variables: the location of thetaxi (values


View Full Document

Berkeley COMPSCI 294 - Discovering Hierarchy in Reinforcement Learning with HEXQ

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Discovering Hierarchy in Reinforcement Learning with HEXQ
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Discovering Hierarchy in Reinforcement Learning with HEXQ and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Discovering Hierarchy in Reinforcement Learning with HEXQ 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?