Machine Learning ! !! ! !Srihari 1 Probability Distributions Sargur N. SrihariMachine Learning ! !! ! !Srihari 2 Distributions: Landscape Discrete- Binary Discrete- Multivalued Continuous Bernoulli Multinomial Gaussian Angular Von Mises Binomial Beta Dirichlet Gamma Wishart Student’s-t Exponential UniformMachine Learning ! !! ! !Srihari 3 Distributions: Relationships Discrete- Binary Discrete- Multi-valued Continuous Bernoulli Single binary variable Multinomial One of K values = K-dimensional binary vector Gaussian Angular Von Mises Binomial N samples of Bernoulli Beta Continuous variable between {0,1] Dirichlet K random variables between [0.1] Gamma ConjugatePrior of univariate Gaussian precision Wishart Conjugate Prior of multivariate Gaussian precision matrix Student’s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians Exponential Special case of Gamma Uniform N=1 Conjugate Prior Conjugate Prior Large N K=2 Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrixMachine Learning ! !! ! !Srihari 4 Binary Variables Bernoulli, Binomial and BetaMachine Learning ! !! ! !Srihari 5 Bernoulli Distribution • Expresses distribution of Single binary-valued random variable x ε {0,1} • Probability of x=1 is denoted by parameter µ, i.e., p(x=1|µ)=µ"• Therefore p(x=0|µ)=1-µ"• Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x • Mean is shown to be E[x]=µ"• Variance is Var[x]=µ(1-µ) • Likelihood of n observations independently drawn from p(x|µ) is • Log-likelihood is • Maximum likelihood estimator – obtained by setting derivative of ln p(D|µ) wrt µ equal to zero is • If no of observations of x=1 is m then µML=m/N Jacob Bernoulli 1654-1705Machine Learning ! !! ! !Srihari 6 Binomial Distribution • Related to Bernoulli distribution • Expresses Distribution of m – No of observations for which x=1 • It is proportional to Bern(x|µ) • Add up all ways of obtaining heads • Mean and Variance are Histogram of Binomial for N=10 and µ=0.25Machine Learning ! !! ! !Srihari 7 Beta Distribution • Beta distribution • Where the Gamma function is defined as • a and b are hyperparameters that control distribution of parameter µ"• Mean and Variance a=0.1, b=0.1 a=1, b=1 a=2, b=3 a=8, b=4 Beta distribution as function of µ"For values of hyperparameters a and bMachine Learning ! !! ! !Srihari 8 Bayesian Inference with Beta • MLE of µ in Bernoulli is fraction of observations with x=1 – Severely over-fitted for small data sets • Likelihood function takes products of factors of the form µx(1-µ)(1-x) • If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior – Called conjugacy • Beta has form suitable for a prior distribution of p(µ)Machine Learning ! !! ! !Srihari 9 Bayesian Inference with Beta • Posterior obtained by multiplying beta prior with binomial likelihood yields – where l=N-m, which is no of tails – m is no of heads • It is another beta distribution – Effectively increase value of a by m and b by l – As number of observations increases distribution becomes more peaked a=2, b=2 N=m=1, with x=1 a=3, b=2 Illustration of one step in process µ1(1-µ)0Machine Learning ! !! ! !Srihari 10 Predicting next trial outcome • Need predictive distribution of x given observed D – From sum and products rule • Expected value of the posterior distribution can be shown to be – Which is fraction of observations (both fictitious and real) that correspond to x=1 • Maximum likelihood and Bayesian results agree in the limit of infinite observations – On average uncertainty (variance) decreases with observed data € p(x =1 | D) = p(x =1,µ| D)dµ01∫= p(x =1 |µ) p(µ| D)dµ01∫= =µp(µ| D)dµ01∫= E[µ| D]Machine Learning ! !! ! !Srihari 11 Summary • Single Binary variable distribution is represented by Bernoulli • Binomial is related to Bernoulli – Expresses distribution of number of occurrences of either 1 or 0 in N trials • Beta distribution is a conjugate prior for Bernoulli – Both have the same functional formMachine Learning ! !! ! !Srihari 12 Multinomial Variables Generalized Bernoulli and DirichletMachine Learning ! !! ! !Srihari 13 Generalization of Bernoulli • Discrete variable that takes one of K values (instead of 2) • Represent as 1 of K scheme – Represent x as a K-dimensional vector – If x=3 then we represent it as x=(0,0,1,0,0,0)T – Such vectors satisfy • If probability of xk=1 is denoted µk then distribution of x is given by Generalized BernoulliMachine Learning ! !! ! !Srihari 14 Likelihood Function • Given a set of D of N independent observations x1,..xN • The likelihood function has the form • Where mk=Σn xnk is the number of observations of xk=1 • The maximum likelihood solution (obtained by log-likelihood and derivative wrt zero) is which is fraction of N observations for which xk=1Machine Learning ! !! ! !Srihari 15 Generalized Binomial Distribution • Multinomial distribution • Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size • Given byMachine Learning ! !! ! !Srihari 16 Dirichlet Distribution • Family of prior distributions for parameters µk of multinomial distribution • By inspection of multinomial, form of conjugate prior is • Normalized form of Dirichlet distribution Lejeune Dirichlet 1805-1859Machine Learning ! !! ! !Srihari 17 Dirichlet over 3 variables • Due to summation constraint – Distribution over space of {µk} is confined to the simplex of dimensionality K-1 – For K=3 αk=0.1 αk=1 αk=10 Plots of Dirichlet distribution over the simplex for various settings of parameters αkMachine Learning ! !! ! !Srihari 18 Dirichlet Posterior Distribution • Multiplying prior by likelihood • Which has the form of the Dirichlet distributionMachine Learning ! !! ! !Srihari 19 Summary • Multinomial is a generalization of Bernoulli – Variable takes on one of K values instead of 2 • Conjugate prior of
View Full Document