CS 59000 Statistical Machine learningLecture 6Alan QiOutlineML and Bayesian estimation of Gaussian distributionst-distributions and mixture of GaussiansExponential familyBayes’ Theorem for Gaussian VariablesGivenwe havewhereMaximum Likelihood for the Gaussian (1)Given i.i.d. data , the log likeli-hood function is given bySufficient statisticsMaximum Likelihood for the Gaussian (2)Set the derivative of the log likelihood function to zero,and solve to obtainSimilarlyMaximum Likelihood for the Gaussian (3)Under the true distributionHence define Is it biased?Contribution of the Nthdata point, xNSequential Estimationcorrection given xNcorrection weightold estimateBayesian Inference for the Gaussian (1)Assume ¾2is known. Given i.i.d. data, the likelihood function for¹ is given byThis has a Gaussian shape as a function of ¹ (but it is not a distribution over ¹).Bayesian Inference for the Gaussian (2)Combined with a Gaussian prior over ¹,this gives the posteriorCompleting the square over ¹, we see thatBayesian Inference for the Gaussian (3)… whereNote:Bayesian Inference for the Gaussian (4)Example: for N = 0, 1, 2 and 10.Data points are sampled from a Gaussian of mean 0.8 & variance 0.1Bayesian Inference for the Gaussian (5)Sequential EstimationThe posterior obtained after observing N { 1data points becomes the prior when we observe the Nthdata point.Bayesian Inference for the Gaussian (6)Now assume ¹ is known. The likelihood function for ¸ = 1/¾2is given byThis has a Gamma shape as a function of ¸.Bayesian Inference for the Gaussian (7)The Gamma distributionBayesian Inference for the Gaussian (8)Now we combine a Gamma prior, ,with the likelihood function for ¸ to obtainwhich we recognize as withBayesian Inference for the Gaussian (9)If both ¹ and ¸ are unknown, the joint likelihood function is given byWe need a prior with the same functional dependence on ¹ and ¸.Bayesian Inference for the Gaussian (10)The Gaussian-gamma distribution• Quadratic in ¹.• Linear in ¸.• Gamma distribution over ¸.• Independent of ¹.Bayesian Inference for the Gaussian (11)The Gaussian-gamma distributionBayesian Inference for the Gaussian (12)Multivariate conjugate priors• ¹ unknown, ¤ known: p(¹) Gaussian.• ¤ unknown, ¹ known: p(¤) Wishart,• ¤ and ¹ unknown: p(¹,¤) Gaussian-Wishart,Student’s t-Distribution (1)If we integrate out the precision of a Gaussian with a Gamma prior, we obtainSetting and , we haveStudent’s t-Distribution (2)Student’s t-Distribution (3)Robustness to outliers: Gaussian vs t-distribution.Student’s t-Distribution (4)The D-variate case:where .Properties:Mixtures of Gaussians (1)Old Faithful data setSingle Gaussian Mixture of two GaussiansMixtures of Gaussians (2)Combine simple models into a complex model:ComponentMixing coefficientK=3Mixtures of Gaussians (3)Mixtures of Gaussians (4)Determining parameters ¹, §, and ¼ using maximum log likelihoodSolution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm (Chapter 9). Log of a sum; no closed form maximum.The Exponential Family (1)where ´ is the natural parameter andso g(´) can be interpreted as a normalization coefficient.The Exponential Family (2.1)The Bernoulli DistributionComparing with the general form we see thatand soLogistic sigmoidThe Exponential Family (2.2)The Bernoulli distribution can hence be written asWhereReminder:The Exponential Family (3.1)The Multinomial Distributionwhere, , andNOTE: The ´kparameters are not independent since the corresponding ¹kmust satisfyThe Exponential Family (3.2)Let . This leads toand Here the ´kparameters are independent. Note thatandSoftmaxThe Exponential Family (3.3)The Multinomial distribution can then be written as whereThe Exponential Family (4)The Gaussian DistributionwhereML for the Exponential Family (1)From the definition of g(´) we getThusML for the Exponential Family (2)Give a data set, , the likelihood function is given by Thus we have Sufficient statisticConjugate priorsFor any member of the exponential family, there exists a priorCombining with the likelihood function, we getPrior corresponds to º pseudo-observations with value Â.Posterior of Gaussian mean
View Full Document