Machine Learning ! !! ! !srihari 1 Gaussian Distribution Sargur N. SrihariMachine Learning ! !! ! !srihari 2 The Gaussian Distribution • For single real-valued variable x • Parameters: – Mean µ, variance σ 2, • Standard deviation σ "• Precision β =1/σ 2, E[x]=µ, Var[x]=σ 2 • For D-dimensional vector x, multivariate Gaussian N(x |µ,σ2) =1(2πσ2)1/ 2exp −12σ2(x −µ)2⎧ ⎨ ⎩ ⎫ ⎬ ⎭ ⎭⎬⎫⎩⎨⎧−Σ−−Σ=Σ−)ìx()ìx(21exp||1)2(1),ì|x(12/12/TDNπµ is a mean vector, Σ is a D x D covariance matrix, |Σ| is the determinant of Σ"Σ-1 is also referred to as the precision matrix Carl Friedrich Gauss 1777-1855Machine Learning ! !! ! !srihari Covariance Matrix • Gives a measure of the dispersion of the data • It is a D x D matrix – Element in position i,j is the covariance between the ith and jth variables. • Covariance between two variables xi and xj is defined as E[(xi-µi)(yi-µj)] • Can be positive or negative – If the variables are independent then the covariance is zero. • Then all matrix elements are zero except diagonal elements which represent the variances 3Machine Learning ! !! ! !srihari 4 Importance of Gaussian • Gaussian arises in many different contexts, e.g., – For a single variable, Gaussian maximizes entropy (for given mean and variance) – Sum of set of random variables becomes increasingly Gaussian One variable histogram (uniform over [0,1]) Mean of two variables Mean of ten variables The two values could be 0.8 and 0.2 whose average is 0.5 More ways of getting 0.5 than say 0.1Machine Learning ! !! ! !srihari 5 Geometry of Gaussian • Functional dependence of Gaussian on x is through – Called Mahanalobis Distance – reduces to Euclidean distance when Σ is an identity matrix • Matrix Σ is symmetric – Has an Eigenvector equation "Σui = λiui ui are Eigen vectors " "λi are Eigen values " Two dimensional Gaussian x = (x1,x2) )ìx()ìx(12−Σ−=Δ−TRed: Elliptical contour of constant density Major axes: eigenvectors uiMachine Learning ! !! ! !srihari 6 Contours of Constant Density • Determined by Covariance Matrix – Covariances represent how features vary together (a) General form (b) Diagonal matrix (aligned with coordinate axes) (c) Proportional to Identity matrix (concentric circles)Machine Learning ! !! ! !srihari 7 Joint, Marginal and Conditional with Gaussian • If two sets of variables xa,xb are jointly Gaussian then the two conditional densities and the two marginals are also Gaussian • Given joint Gaussian N(x|µ,Σ) with Λ=Σ-1 and x = [xa,xb]T where xa are first m components of x and xb are next D-m components • Conditionals • Marginals Joint p(xa, xb) Marginal p(xa) and Conditional p(xa|xb) )x( where),|x()x|x(1|1| bbabaaabaaababaNpµµµµ−ΛΛ−=Λ=−−⎟⎟⎠⎞⎜⎜⎝⎛ΣΣΣΣ=ΣΣ=bbbaabaaaaaaaxNxp where),|()(µMachine Learning ! !! ! !srihari 8 Maximum Likelihood for the Gaussian • Given a data set X=(x1,..xN)T where the observations {xn} are drawn independently • Log-likelihood function is given by • Derivative wrt µ is • Whose solution is • Maximization w.r.t. Σ is more involved. Yields ∑=−−Σ−−Σ−−=ΣNnnTnNNDXp11)x()x(21||ln2)2ln(2),|( lnµµπµ∑=−∂∂−Σ=ΣNnnXp11)x(),|( lnµµµ∑==NnnNML1x1µTMLnNnMLnMLN)x()x(11µµ−−=Σ∑=Machine Learning ! !! ! !srihari Bias of M. L. Estimate of Covariance Matrix • For N(µ,Σ), m.l.e. of Σ for samples x1,..xN is • arithmetic average of N matrices: • Since we have – m.l.e. is smaller than the true value of Σ – Thus m.l.e. is biased • irrespective of no of samples does not give exact value. – For large N inconsequential. • Rule of thumb: use 1/N for known mean and 1/(N-1) for estimated mean. • Bias does not exist in Bayesian solution. 9 TMLnNnMLnMLN)x()x(11µµ−−=Σ∑= (xn− µML)(xn− µML)T E[ΣML] =1N −1(xn− µML)n=1N∑(xn− µML)TE[ΣML] =N − 1NΣMachine Learning ! !! ! !srihari 10 Sequential Estimation • In on-line applications and large data sets batch processing of all data points in infeasible – Real-time learning scenario where steady stream of data is arriving and predictions must be made before all data is seen • Sequential methods allow data points to be processed one-at-a-time and then discarded – Sequential learning arises naturally with Bayesian viewpoint • M.L.E. for parameters of Gaussian gives a convenient opportunity to discuss more general discussion of sequential estimation for maximum likelihoodMachine Learning ! !! ! !srihari 11 Sequential Estimation of Gaussian Mean • By dissecting contribution of final data point • Same as earlier batch result • Nice interpretation: – After observing N-1 data points we have estimated µ by µMLN-1 – We now observe data point xN and we obtain revised estimate by moving old estimate by small amount – As N increases contribution from successive points smaller )x(1 x111-NML1−=−+==∑NMLNNnnNNMLµµµMachine Learning ! !! ! !srihari 12 General Sequential Estimation • Sequential algorithms cannot always be factored out • Robbins and Monro (1951) gave a general solution • Consider pair of random variables θ and z with joint distribution p(z,θ) • Conditional expectation of z given q is • Which is called a regression function – Same as one that minimizes expected squared loss seen earlier • It can be shown that maximum likelihood solution is equivalent to finding the root of the regression function – Goal is to find θ* at which f(θ*)=0 ∫== dzzzpzEf )|(]|[)(θθθMachine Learning ! !! ! !srihari 13 Robbins-Monro Algorithm • Defines sequence of successive estimates of root θ* as follows • Where z(θ(N))is observed value of z when θ takes the value θ(N) • Coefficients {aN} satisfy reasonable conditions • Solution has a form where z involves a derivative of p(x|θ) wrt θ • Special case of Robbons-Monro is solution for Gaussian mean )()1(1)1()( −−−+=NNNNzaθθθ∞<∞==∑∑∞=∞=∞→1N21N , ,0limNNNNaaaMachine Learning ! !! ! !srihari 14 Bayesian Inference for the Gaussian • MLE framework gives point estimates for
View Full Document