DOC PREVIEW
UT EE 381K - Research Paper

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AN ALTERNATING MINIMIZATION ALGORITHMFOR NON-NEGATIVE MATRIX APPROXIMATIONJOEL A. TROPPAbstract. Matrix approximation problems with non-negativity constraints arise duringthe analysis of high-dimensional non-negative data. Applications include dimensionalityreduction, feature extraction and statistical factor analysis. This paper discusses a generalalgorithm for solving these problems with respect to a large class of error measures. Theprimary goal of this article is to develop a rigorous convergence theory for the algorithm.It also touches on some experimental results of S. Sra [1] which indicate the superiority ofthis method to some previously proposed techniques.Date: 8 May 2003.J. Tropp is with The University of Texas, Austin, TX 78712 USA, [email protected] MATRIX APPROXIMATION 21. IntroductionSuppose that A is a d × N matrix whose columns are vectors of non-negative data. Theproblem of extracting r linear features from A can be stated as solving the approximationproblem A ≈ VH , where V is a d × r matrix of feature vectors and H is an r × N matrix ofcoefficients. We quantify the approximation error using a measure that reflects the prove-nance of the data. For example, if we are working with probability distributions, it makessense to use the variational distance or the Kullback-Leibler divergence.A common way to find features is to compute a partial singular value decomposition (SVD)of the data matrix [2]. Although partial SVD provides the best approximation to A in thesense of total least squares, it yields feature vectors that have both positive and negativecomponents. Such features have no interpretation if the data are non-negative. If the dataconsist of pixel intensities from images, what does a negative intensity signify? If the dataconsist of word counts from documents, what does a negative word count mean?To address this metaphysical complaint, several groups of researchers have proposedadding non-negativity constraints to the approximation problem. Paatero and Tapper ad-vocated solving minV ,H ≥0kVH − AkFfor statistical factor analysis [3]. In their first paperon the subject [3], they suggest using an alternating least-squares algorithm [3] to solvethe problem. Later Paatero proposes instead a modified gradient descent [4] and a conju-gate gradient method [5]. Independently, Lee and Seung considered the same optimizationproblem for some applications in machine learning. They originally solved it using an al-ternating projected gradient technique [6]. More recently, they have provided two optimallyscaled gradient descent algorithms which respectively minimize the Frobenius norm and theKullback-Leibler divergence of the approximation error [7]. None of these algorithms havebeen compared in the literature, and proper convergence proofs are scarce.We shall call the problem of approximating A ≈ VH subject to the constraints V , H ≥ 0non-negative matrix approximation (NNMA). The present work discusses an alternating min-imization algorithm for solving NNMA problems where the approximation error is measuredNON-NEGATIVE MATRIX APPROXIMATION 3using a very general type of function. The primary goal is to provide a rigorous account ofthe convergence of this algorithm. I shall also touch on the work of my colleague Suvrit Sra,who has performed some preliminary experiments to compare the actual behavior of thisalgorithm with some of the other proposed methods [1].2. Mathematical Background2.1. Point-to-Set Maps. To understand the convergence of the algorithm, we need someelementary concepts from the theory of point-to-set maps. The power set P(Y ) of a set Yis the collection of all subsets of Y . A point-to-set map Ω from the set X to the set Y is afunction Ω : X −→ P(Y ). The composition of two point-to-set maps Ω1: X −→ P(Y )and Ω2: Y −→ P(Z ) is defined by (Ω2◦ Ω1)(x) = ∪y∈Ω1(x)Ω2(y) [8].Suppose that X and Y are endowed with topologies so that we may speak of convergence.A point-to-set map Ω is closed at bx ∈ X if {xk} ⊂ X , xk−→ bx, yk∈ Ω(xk) and yk−→ bytogether imply that by ∈ Ω(bx). In words, every convergent sequence whose elements lie inthe sets {Ω(xk)} must have its limit in the set Ω(bx). A map is closed on X if it is closed ateach point of X . The composition of two closed maps is always closed [8].We also define two different types of stationary points. A fixed point of the map Ω :X −→ P(X ) is a point x for which {x} = Ω(x). Meanwhile, a generalized fixed point of Ωis a point x for which x ∈ Ω(x) [9].2.2. Iterative Algorithms. Many procedures in mathematical programming can be de-scribed using the language of point-to-set maps. An algorithm is a point-to-set map Ω :X −→ P(X ). Given an initial point x0, an algorithm generates a sequence of points viathe rule xk+1∈ Ω(xk). Now, suppose that J : X −→ R+is a continuous, non-negativefunction. An algorithm Ω is monotonic with respect to J whenever y ∈ Ω(x) implies thatJ(y) ≤ J(x). If, in addition, y ∈ Ω(x) and J(y) = J(x) imply that y = x, then we say thatthe algorithm is strictly monotonic.NON-NEGATIVE MATRIX APPROXIMATION 4Theorem 2.1 (Zangwill [10]). Let Ω be an algorithm that is monotonic with respect to J.Given an initial point x0, suppose that the algorithm generates a sequence {xk} that lies in acompact set. Then the sequence has at least one accumulation point bx, and J(bx) = lim J(xk).Moreover, if Ω is closed at bx then bx is a generalized fixed point of the algorithm.Theorem 2.2 (Meyer [9]). Assume that the algorithm Ω is strictly monotonic with respectto J and that it generates a sequence {xk} which lies in a compact set. If Ω is closed at anaccumulation point bx of {xk} then bx is a (strong) fixed point of Ω. Moreover, if X is normed,kxk+1− xkk −→ 0. It follows that {xk} converges in norm to bx or that the accumulationpoints of {xk} form a continuum.2.3. Infimal Maps. Suppose that f : X × Y −→ R+. Then, we may define the infimalmap M : X −→ P(Y ) by My(x) = arg miny∈Yf(x, y). Define y 7−→ Mx(y) similarly.Theorem 2.3 (Dantzig-Folkman-Shapiro [11]). If f (bx, ·) is continuous on Y , then M isclosed at bx.Theorem 2.4 (Fiorot-Huard [12]). If the infimal maps Mxand Myare both single-valuedthen the algorithm Ωdef= Mx◦ Myis strictly monotonic with respect to f .In Theorem 2.4, the definition of Ω would more properly read Ωdef=


View Full Document

UT EE 381K - Research Paper

Documents in this Course
Load more
Download Research Paper
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Research Paper and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Research Paper 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?