DOC PREVIEW
UCSD ECE 271A - Bayesian Parameter Estimation

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

ECE-271AStatistical Learning I:Bayesian parameter estimationBayesian parameter estimationBayesian predictionsThe Gaussian casePropertiesPropertiesConjugate priorsExponential familyPredictive distributionPredictive distributionPriorsExampleInvariant non-informative priorsLocation parametersScale parametersSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsSelecting priorsRegularizationRegularizationECE-271AStatistical Learning I:Bayesian parameter estimationNuno Vasconcelos ECE Department, UCSD2Bayesian parameter estimationbasic concepts:• explicitly define the set of training random variables, T = {X1,...,Xn}, from which training set D = {x1, ..., xn} is drawn • parameter Θis a random variable with prior densitywhich encodes “observer’s beliefs” on experimental outcomes• likelihood function gives distribution of observations given parameter valuegoal:•estimatethe complete posterior distribution to obtain a complete characterization of Θ)(θΘP)|(|θxPXΘ)|(|DPTθΘ3Bayesian predictionse.g. class-conditional likelihoods that we use in the BDRare based on the predictive distribution•averageof all models (unlike ML estimate)• each model weighted by how much it is plausible, given the training set Dall predictions conditioned on T• no information is lost• posterior is a compact representation of the training set• computationally efficient when the integral has closed form[]DTxPEdDPxPDxPXTTXTX===ΘΘΘΘ∫|)|()|()|()|(|||||θθθθ4The Gaussian caselast class we considered the Gaussian problemand showed that withgood example of various properties that are typical of Bayesian parameter estimates()known 22|,,,)|(σσµµµxGxPX=()200,,)(σµµµµGP=()2|,,)|(nnTxGDP σµµµ=0220222020ˆµσσσµσσσµ+++=nnnMLn202211σσσ+=nn5Propertieslikelihood dominates as n goes to infinityprior dominates as n goes to zeroµnlinear interpolant between the two solutionsvariance of posterior goes to zero as n goes to infinityBayes strictly more precise than either ML or priorintuitive balance between prior and likelihood parameters10)1(ˆ00∞→→→→−+=nnnnnMLnnααµαµαµ with priorMLBayesnprecprecprecn+=⇔+= 202211σσσdominates prior 0 n≈⇒<<ασσ220dominates likelihood 1 n≈⇒>>ασσ2206Propertiesregularization:• if thenBayes is equal to ML on a virtual sample with extra points• in this case, one additional point equal to the mean of the prior• for large n, extra point is irrelevant• for small n, it regularizes the Bayes estimate by • directing the posterior mean towards the prior mean• reducing the variance of the posteriorHW: this interpretation holds for all conjugate priors 202211σσσ+=nn 220σσ=01110,1111ˆ11µµµµ=+=+++=++=∑iniiMLnXXnnn with7Conjugate priorsnote that• the prior is Gaussian• the posterior is Gaussianwhenever this is the case (posterior in the same family as prior) we say that • is a conjugate prior for the likelihood• posterior is the reproducing densityHW: a number of likelihoods have conjugate priors()200,,)(σµµµµGP=()2|,,)|(nnTxGDP σµµµ=)(µµP)|(|µµxPX)|(|DPTµµLikelihood Conjugate priorBernoulli BetaPoisson GammaExponential GammaNormal (known σ2)Gamma8Exponential familyyou will also show that all of these likelihoods are members of the exponential familyfor this family, the interpretation of Bayesian parameter estimation as “ML on a properly augmented sample”always holds (whenever the prior is the conjugate)this is one of the reasons why the exponential family is “special” (but there are others))()(|)()()|(xuXTegxfxPθφθθ =Θ9Predictive distributionwe have seen thatwe can now compute the predictive distributioni.e. X|T is the random variable that results from adding two independent Gaussians with these parameters()2|,,)|(nnTxGDP σµµµ=()),,(*),0,(),,()(),0,()()()()|()|()|(222)(22)(|||222nnnnnxTXTXxGxGxGxhxGxfdhxfdeedDPxPDxPnnσµσσµσµµµµπσπσµµµσµµσµµµ222 and with 2121 2===−===∫∫∫−−−−10Predictive distributionhence X|T is Gaussian with• the mean is that of the posterior• variance increased by σ2to account for the uncertainty of the observationsnote:• we will not go over the multivariate case in class, but the expressions are straightforward generalization• make sure you are comfortable with them),,()|(22|nnTXxGDxP σσµ+=11Priorspotential problem of the Bayesian framework• “I don’t really have a strong belief about what the most likely parameter configuration is”in these cases it is usual to adopt a non-informative priorthe most obvious choice is the uniform distributionthere are, however, problems with this choice• if θis unbounded this is an improper distribution• the prior is not invariant to all reparametrizationsαθ=Θ)(P1)( ≠∞=∫∞∞−Θθθ dP12Exampleconsider Θand a new random variable ηwith η= eΘsince this is a 1-to-1 transformation it should not affect the outcome of the inference processwe check this by using the well know fact that• if y = f(x) thenin this case ())(1)(1)(1yfPxfyPXyfxY−=−∂∂=() ()ηηηθηηθθηlog1log1)(logΘΘ==∂∂=PPeP13Invariant non-informative priorsfor uniform ηthis means that , i.e. not constantthis means that• there is no consistency between Θand h• a 1-to-1 transformation changes the non-informative prior into an informative oneto avoid this problem the non-informative prior has to be invariante.g. consider a location parameter:• a parameter that simply shifts the density• e.g. the mean of a Gaussiana non-informative prior for a location parameter has to be invariant to shifts, i.e. the transformation Y = µ+ cηαηη1)(P14Location parametersin this caseand, since this has to be valid for all c,hencewhich is valid for all c if and only if Pµ(µ) is uniformnon-informative prior for location is Pµ(µ)α1()()cyPcyPcyPcyY−=−∂+∂=−=µµµµµ)(1)(()yPyPY µ=)(()()yPcyPµµ=−15Scale parametersa scale parameter is one that controls the scale of the densitye.g. the variance of a Gaussian distributionit can be shown that, in this case, the non-informative prior invariant to scale transformations isnote that, as for location, this is an improper


View Full Document

UCSD ECE 271A - Bayesian Parameter Estimation

Download Bayesian Parameter Estimation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Bayesian Parameter Estimation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bayesian Parameter Estimation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?