DOC PREVIEW
Max-norm Projections for Factored MDPs

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

International Joint Conference on Artificial Intelligence (IJCAI-01), vol. 1, pp. 673 - 680, Seattle, Washington, August 2001.Max-norm Projections for Factored MDPsCarlos GuestrinComputer Science Dept.Stanford [email protected] KollerComputer Science Dept.Stanford [email protected] ParrComputer Science Dept.Duke [email protected] Decision Processes (MDPs) provide a coherentmathematical framework for planning under uncertainty.However, exact MDP solution algorithms require the manip-ulation of a value function, which specifies a value for eachstate in the system. Most real-world MDPs are too largefor such a representation to be feasible, preventing the useof exact MDP algorithms. Various approximate solution al-gorithms have been proposed, many of which use a linearcombination of basis functions as a compact approximationto the value function. Almost all of these algorithms use anapproximation based on the (weighted)-norm (Euclideandistance); this approach prevents the application of standardconvergence results for MDP algorithms, all of which arebased on max-norm. This paper makes two contributions.First, it presents the first approximate MDP solution algo-rithms — both value and policy iteration — that use max-norm projection, thereby directly optimizing the quantity re-quired to obtain the best error bounds. Second, it shows howthese algorithms can be applied efficiently in the context offactored MDPs, where the transition model is specified usinga dynamic Bayesian network.1 IntroductionOver the last few years, Markov Decision Processes (MDPs)have been used as the basic semantics for optimal planningfor decision theoretic agents in stochastic environments. Inthe MDP framework, the system is modeled via a set of stateswhich evolve stochastically. The key problem with this rep-resentation is that, in virtually any real-life domain, the statespace is quite large. However, many large MDPs have signif-icant internal structure, and can be modeled compactly if thestructure is exploited in the representation.Factored MDPs[Boutilier et al., 1999]are one approachto representing large, structured MDPs compactly. In thisframework, a state is implicitly described by an assignmentto some set of state variables. A dynamic Bayesian network(DBN)[Dean and Kanazawa, 1989]can then allow a compactrepresentation of the transition model, by exploiting the factthat the transition of a variable often depends only on a smallnumber of other variables. Furthermore, the momentary re-wards can often also be decomposed as a sum of rewards re-lated to individual variables or small clusters of variables.Even when a large MDP can be represented compactly,e.g., in a factored way, solving it exactly is still intractable:Exact MDP solution algorithms require the manipulation of avalue function, whose representation is linear in the numberof states, which is exponential in the number of state vari-ables. One approach is to approximate the solution using anapproximate value function with a compact representation. Acommon choice is the use of linear value functions as an ap-proximation — value functions that are a linear combinationof basis functions.This paper makes a twofold contribution. First, we providea new approach for approximately solving MDPs using a lin-ear value function. Previous approaches to linear function ap-proximation typically have utilized a least squares (-norm)approximation to the value function. Least squares approxi-mations are incompatible with most convergence analyses forMDPs, which are based on max-norm. We provide the firstMDP solution algorithms — both value iteration and policyiteration — that use a linear max-norm projection to approxi-mate the value function, thereby directly optimizing the quan-tity required to obtain the best error bounds.Second, we show how to exploit the structure of the prob-lem in order to apply this technique to factored MDPs.Our work builds on the ideas of Koller and Parr[1999;2000], by using factored (linear) value functions, where eachbasis function is restricted to some small subset of the do-main variables. We show that, for a factored MDP and fac-tored value functions, various key operations can be imple-mented in closed form without enumerating the entire statespace. Thus, our max-norm algorithms can be implementedefficiently, even though the size of the state space grows ex-ponentially in the number of variables.2 Markov decision processesA Markov Decision Process (MDP) is defined as a 4-tuplewhere: is a finite set of states; is a setof actions;is a reward function IR, such thatrepresents the reward obtained by the agent in stateafter taking action ; and is a Markovian transition modelwhererepresents the probability of going fromstate to state with action .We will be assuming that the MDP has an infinite horizonand that future rewards are discounted exponentially with adiscount factor. A stationary policy for an MDPis a mapping , where is the action the agenttakes at state. The policy is associated with a value functionIR , where is the discounted cumulative valuethat the agent gets if it starts at state. The value functionfor a fixed policy is the fixed point of a set of equations thatdefine the value of a state in terms of the value of its possiblesuccessor states. More formally, we define:Definition 2.1 The DP operator,, for a fixed stationarypolicy is:is the fixed point of : .The optimal value function is also defined by a set ofequations. In this case, the value of a state must be the max-imal value achievable by any action at that state. More pre-cisely, we define:Definition 2.2 The Bellman operator, , is:is the fixed point of : .For any value function , we can define the policy obtainedby acting greedily relative to. In other words, at each state,we take the action that maximizes the one-step utility, assum-ing thatrepresents our long-term utility achieved at the nextstate. More precisely, we defineGreedyThe greedy policy relative to the optimal value functionis the optimal policyGreedy. There are severalalgorithms to compute the optimal policy, we will focus onthe two most used: value iteration and policy iteration.Value iteration relies on the fact that the Bellman operatoris a contraction — it is guaranteed to reduce the max-norm() distance between any pair of value functions by a fac-tor of at least. This property guarantees that the Bell-man operator has a


Max-norm Projections for Factored MDPs

Download Max-norm Projections for Factored MDPs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Max-norm Projections for Factored MDPs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Max-norm Projections for Factored MDPs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?