UCD MAT 280 - High Dimensional Statistical Inference

Unformatted text preview:

arXiv:math.ST/0611589 v1 19 Nov 2006High Dimensional Statistical Inference andRandom MatricesIain M. Johnstone∗Abstract. Multivariate statistical analysis is concerned with observations on severalvariables which are thought to possess some degree of inter-dependence. Driven by p rob-lems in genetics and the social sciences, it first flowered in the earlier half of the lastcentury. Subsequent ly, random matrix theory (RMT) developed, initially within physics,and more recently widely in mathematics. While some of the ce ntral objects of study inRMT are identical to those of multivariate statistics, statistical theory was slow to exploitthe connection. However, with vast data collection ever more common, data sets nowoften have as many or more variables than the number of individuals observed. In suchcontexts, the techniques and results of RMT have much to offer multivariate statistics.The p aper reviews some of the progress to date.Mathematics Subject Classification (2000). Primary 62H10; 62H25; 62H20; Sec-ondary 15A52.Keywords. canonical correlations; eigenvector estimation; largest eigenvalue; princi-pal components analysis; Random matrix theory; Wishart distribution; Tracy-Widomdistribution.1. IntroductionMuch current research in statistics, both in statistical theory, and in many areas ofapplication, such as g e nomics, c limatology or astronomy, focuses on the problemsand opportunities pose d by availability of large amounts of data. (More detail maybe found, for example, in the paper by Fan and Li [40] in these proceedings.) Theremight be many variables and/or many observations on each variable. Loosely onecan think of each variable a s an additional dimension, and so ma ny variables cor-responds to data sitting in a high dimensiona l space. Among several mathematicalthemes one could follow – Banach space theory, convex geometry, even topology– this paper fo c us es on Random Matrix Theory, and some of its interactions withimpo rtant areas of what in statistics is called “Multivariate Analysis.”∗The author is grateful to Persi Diaconis, Noureddine El Karoui, Peter Forrester, MatthewHarding, Plamen Koev, Debashis Paul, Donald Richards and Craig Tracy for advice and com-ments during the writing of this paper, to the Australian National University for hospitality, andto NSF DMS 0505303 and N IH R01 EB001988 for financial support.2 Iain M. JohnstoneMultiva riate analysis deals with observations on more than one variable whenthere is or may be some dependence between the variables. The most bas ic phe-nomenon is tha t of correlation – the tendency of qua ntities to vary to gether: tallparents tend to have tall children. From the beginning, there has a lso been afo c us on summarizing and interpreting data by reducing dimension, for ex ampleby methods such as Principal Components Analysis (PCA). While there are manymethods and corresponding problems of mathematical interest, this paper con-centrates largely on PCA as a leading example, together with a few remarks onrelated problems. Other over views with substantial statistical content include [5],[30] and [36].In an effor t to define terms and give an example, the earlier se c tions coverintr oductory material, to set the stage. The more recent work, in the later sections,concentrates on results and phenomena which appear in an asymptotic regime inwhich p, the number of variables increases to infinity, in proportion to samplesize n.2. Background2.1. Principal Components Analysis. Principal Components Analysis(PCA) is a standa rd technique of multivariate statistics, going back to Karl Pearsonin 1901 [75] and Harold Hotelling in 193 3 [51]. There is a huge literature [63] andinteresting modern variants continue to appear [80, 8 7]. A brief description of theclassical method, an example and references are included here for convenience.PCA is usually described first for abstract random variables, and then later asan algo rithm for obser ved data. So first suppose we have p variables X1, . . . , Xp.We think of these as random variables though, initially, little mo re is a ssumed thanthe existence of a covariance matrix Σ = (σkk′), composed of the mean-correc tedsecond momentsσkk′= Cov(Xk, Xk′) = E(Xk− µk)(Xk′− µk′).The goal is to reduce dimensionality by constructing a smaller number of “de-rived” variables W =PkvkXk, having varianceVar(W ) =Xk,k′vkσkk′vk′= vTΣv.To concentrate the variation in as few derived variables as possible, one look s fo rvectors that maximize Var(W ). Succ e ssive linear combinations are sought thatare orthogona l to those previously chosen. The principal component eigenvaluesℓjand principal component eigenvect ors vjare thus obtained fromℓj= max{vTΣv : vTvj′= 0; j′< j, |v| = 1}. (1)In statistics, it is common to as sume a sto chastic model in terms of randomvariables whose distributions contain unknown para meters, which in the presentHigh Dimensional Statistical Inference and Random Matrices 318|)1jd8|)dj................ ............................d?1?rotateFigure 1. The n data observations are viewed as n points in p dimensional space, thep dimensions corresponding to the variables. The sample PC eigenvectorsˆvjcreate arotation of the variables int o the new derived variables, with most of the variation onthe low dimension numbers. I n this two dimensional picture, we might keep the firstdimension and discard the second.case would be the cova riance matrix and its resulting principal components . Toestimate the unknown parameters of this model we have observed data, ass umedto be n observations on each of the p va riables. The observed data on variable Xkis viewed as a vector xk∈ Rn. The vectors of observations on each variable arecollected as rows into a p × n data matrixX = (xki) = [x1. . . xp]T.A standard pre-processing step is to center each variable by subtrac ting thesample mean ¯xk= n−1Pixki, so that xki← xki− ¯xk. After this centering, definethe p ×p sample covariance matrix S = (skk′) byS = (skk′) = n−1XXT, skk′= n−1Xixkixk′i.The derived variables in the sample, w = Xv =Pkvkxk, have sample variancedVar(w) = vTSv. Maximising this quadratic form leads to successive sample prin-cipal c omponentsˆℓjandˆvjfrom the sample analog of (1):ˆℓj= max{vTSv : vTˆvj′= 0, j′< j, |v| = 1}Equivalently, we obtain for j = 1 , . . . , p,Sˆvj=ˆℓjˆvj, ˆwj= Xˆvj.Note the statistical convention: e stimators derived


View Full Document

UCD MAT 280 - High Dimensional Statistical Inference

Download High Dimensional Statistical Inference
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High Dimensional Statistical Inference and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High Dimensional Statistical Inference 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?