AN ALTERNATING MINIMIZATION ALGORITHMFOR NON-NEGATIVE MATRIX APPROXIMATIONJOEL A. TROPPAbstract. Matrix approximation problems with non-negativity constraints arise duringthe analysis of high-dimensional non-negative data. Applications include dimensionalityreduction, feature extraction and statistical factor analysis. This paper discusses a generalalgorithm for solving these problems with respect to a large class of error measures. Theprimary goal of this article is to develop a rigorous convergence theory for the algorithm.It also touches on some experimental results of S. Sra [1] which indicate the superiority ofthis method to some previously proposed techniques.Date: 8 May 2003.J. Tropp is with The University of Texas, Austin, TX 78712 USA, [email protected] MATRIX APPROXIMATION 21. IntroductionSuppose that A is a d × N matrix whose columns are vectors of non-negative data. Theproblem of extracting r linear features from A can be stated as solving the approximationproblem A ≈ VH , where V is a d × r matrix of feature vectors and H is an r × N matrix ofcoefficients. We quantify the approximation error using a measure that reflects the prove-nance of the data. For example, if we are working with probability distributions, it makessense to use the variational distance or the Kullback-Leibler divergence.A common way to find features is to compute a partial singular value decomposition (SVD)of the data matrix [2]. Although partial SVD provides the best approximation to A in thesense of total least squares, it yields feature vectors that have both positive and negativecomponents. Such features have no interpretation if the data are non-negative. If the dataconsist of pixel intensities from images, what does a negative intensity signify? If the dataconsist of word counts from documents, what does a negative word count mean?To address this metaphysical complaint, several groups of researchers have proposedadding non-negativity constraints to the approximation problem. Paatero and Tapper ad-vocated solving minV ,H ≥0kVH − AkFfor statistical factor analysis [3]. In their first paperon the subject [3], they suggest using an alternating least-squares algorithm [3] to solvethe problem. Later Paatero proposes instead a modified gradient descent [4] and a conju-gate gradient method [5]. Independently, Lee and Seung considered the same optimizationproblem for some applications in machine learning. They originally solved it using an al-ternating projected gradient technique [6]. More recently, they have provided two optimallyscaled gradient descent algorithms which respectively minimize the Frobenius norm and theKullback-Leibler divergence of the approximation error [7]. None of these algorithms havebeen compared in the literature, and proper convergence proofs are scarce.We shall call the problem of approximating A ≈ VH subject to the constraints V , H ≥ 0non-negative matrix approximation (NNMA). The present work discusses an alternating min-imization algorithm for solving NNMA problems where the approximation error is measuredNON-NEGATIVE MATRIX APPROXIMATION 3using a very general type of function. The primary goal is to provide a rigorous account ofthe convergence of this algorithm. I shall also touch on the work of my colleague Suvrit Sra,who has performed some preliminary experiments to compare the actual behavior of thisalgorithm with some of the other proposed methods [1].2. Mathematical Background2.1. Point-to-Set Maps. To understand the convergence of the algorithm, we need someelementary concepts from the theory of point-to-set maps. The power set P(Y ) of a set Yis the collection of all subsets of Y . A point-to-set map Ω from the set X to the set Y is afunction Ω : X −→ P(Y ). The composition of two point-to-set maps Ω1: X −→ P(Y )and Ω2: Y −→ P(Z ) is defined by (Ω2◦ Ω1)(x) = ∪y∈Ω1(x)Ω2(y) [8].Suppose that X and Y are endowed with topologies so that we may speak of convergence.A point-to-set map Ω is closed at bx ∈ X if {xk} ⊂ X , xk−→ bx, yk∈ Ω(xk) and yk−→ bytogether imply that by ∈ Ω(bx). In words, every convergent sequence whose elements lie inthe sets {Ω(xk)} must have its limit in the set Ω(bx). A map is closed on X if it is closed ateach point of X . The composition of two closed maps is always closed [8].We also define two different types of stationary points. A fixed point of the map Ω :X −→ P(X ) is a point x for which {x} = Ω(x). Meanwhile, a generalized fixed point of Ωis a point x for which x ∈ Ω(x) [9].2.2. Iterative Algorithms. Many procedures in mathematical programming can be de-scribed using the language of point-to-set maps. An algorithm is a point-to-set map Ω :X −→ P(X ). Given an initial point x0, an algorithm generates a sequence of points viathe rule xk+1∈ Ω(xk). Now, suppose that J : X −→ R+is a continuous, non-negativefunction. An algorithm Ω is monotonic with respect to J whenever y ∈ Ω(x) implies thatJ(y) ≤ J(x). If, in addition, y ∈ Ω(x) and J(y) = J(x) imply that y = x, then we say thatthe algorithm is strictly monotonic.NON-NEGATIVE MATRIX APPROXIMATION 4Theorem 2.1 (Zangwill [10]). Let Ω be an algorithm that is monotonic with respect to J.Given an initial point x0, suppose that the algorithm generates a sequence {xk} that lies in acompact set. Then the sequence has at least one accumulation point bx, and J(bx) = lim J(xk).Moreover, if Ω is closed at bx then bx is a generalized fixed point of the algorithm.Theorem 2.2 (Meyer [9]). Assume that the algorithm Ω is strictly monotonic with respectto J and that it generates a sequence {xk} which lies in a compact set. If Ω is closed at anaccumulation point bx of {xk} then bx is a (strong) fixed point of Ω. Moreover, if X is normed,kxk+1− xkk −→ 0. It follows that {xk} converges in norm to bx or that the accumulationpoints of {xk} form a continuum.2.3. Infimal Maps. Suppose that f : X × Y −→ R+. Then, we may define the infimalmap M : X −→ P(Y ) by My(x) = arg miny∈Yf(x, y). Define y 7−→ Mx(y) similarly.Theorem 2.3 (Dantzig-Folkman-Shapiro [11]). If f (bx, ·) is continuous on Y , then M isclosed at bx.Theorem 2.4 (Fiorot-Huard [12]). If the infimal maps Mxand Myare both single-valuedthen the algorithm Ωdef= Mx◦ Myis strictly monotonic with respect to f .In Theorem 2.4, the definition of Ω would more properly read Ωdef=
View Full Document