UCSD ECE 271A - Maximum Likelihood Estimation - D1842335

Home> Schools> University of California, San Diego> Electrical & Computer Engineer (ECE) > ECE 271A> Maximum Likelihood Estimation

DOC PREVIEW

UCSD ECE 271A - Maximum Likelihood Estimation

School name University of California, San Diego

Course Ece 271a- Statistical Learning I

Pages 30

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Maximum likelihood estimationMaximum likelihood estimationNuno VasconcelosUCSDMaximum likelihood• parameter estimation in three steps:– 1) choose a parametric model for probabilitiesto make this clear edenote the ector of parameters bΘto make this clear we denote the vector of parameters by Θnote that this means thatΘis NOT a random variable);(ΘxPXnote that this means that Θis NOT a random variable– 2) assemble D(i)= {x1(i), ..., xn(i)} of examples drawn independently– 3) select the parameters that maximize the probability of the data ()()ΘΘ=ΘΘl;maxarg*DPDPX•P(D;Θ)is thelikelihoodof parameterΘwith respect to()Θ=Θ;logmaxarg DPX2•PX(D;Θ)is thelikelihood of parameter Θwith respect to the dataMaximum likelihood• in summary, given a sample, we need to solve()ΘΘ*DPmax•thesolutionsare the parameters()Θ=ΘΘ;maxargDPXthe solutionsare the parameters such that0);(=Θ∇xP0);(=Θ∇ΘxPXnXtxPℜ∈∀≤∇Θθθθθ ,0);(2• note that you always have to check the second-order condition!XΘ,);(3condition!Maximum likelihood• let’s consider the Gaussian example•given asample{T1,…,TN}of independent pointsgiven a sample {T1, …, TN}of independent points• the log-likelihood is• the ML estimates of the mean and variance are4Estimators• when we talk about estimators, it is important to keep in mind that–an estimate is a number– an estimator is a random variable)(ˆXXfθ• an estimate is the value of the estimatorfor a given ),,(1 nXXfK=θgsample. • if D = {x1, ..., xn}, when we say∑=jjxn1ˆµwhat we mean is with jnnxXxXnXXf===,,111),,(ˆKKµ∑XXXf1)(5the Xiare random variables∑=jjnXnXXf),,(1KBias and variance• we know how to produce estimators (by ML)• how do we evaluate an estimator?• Q1: is the expected value equal to the true value?• this is measured by the biasif–if then),,(ˆ1 nXXf K=θthenan estimator that hasbias will usually not convergeto the perfect()[]θθ−= ),,(ˆ1,,1nXXXXfEBiasnKK–an estimator that has bias will usually not convergeto the perfect estimate θ, no matter how large the sample is– e.g. if θis negative and the estimator isthe bias is clearly nonzero∑=jjnXnXXf211),,( K6the bias is clearly non-zerojBias and variance• the estimators is said to be biased– this means that it is not expressive enough to approximate the tr e al e arbitraril elltrue value arbitrarily well– this will be clearer when we talk about density estimation• Q2: assuming that the estimator converges to the true 2ggvalue, how many sample points do we need?– this can be measured by the variance()[](){}2)()(ˆXXfEXXfEVar =θ–the variance usually decreases as one collects more training [](){}1,,1,,),,(),,(11nXXnXXXXfEXXfEnnKKKK−7ygexamplesExample• ML estimator for the mean of a Gaussian N(µ,σ2)()[][]ˆˆˆ−=−=µµµµµEEBias()[][]1 ,,,,,,111−⎥⎦⎤⎢⎣⎡===∑µµµµµµiXXXXXXXNEEEBiasnnKKK[]1 ,,1−=⎥⎦⎢⎣∑∑µµiXXiiXXXENnK[][]1,,1−=∑∑µµiXiiXXXEnnK[]0 =−=∑µµµiiXXEni8• the estimator is unbiasedExample• ML estimator for the mean of a Gaussian N(µ,σ2)()[](){}(){}EEEV22ˆˆˆˆ()[](){}(){}⎪⎬⎫⎪⎨⎧⎞⎜⎛−=−=∑XXXXXXXEEEEVarnnnµµµµµ22,,2,,,,1111ˆˆˆˆKKK ⎪⎫⎪⎧⎞⎛⎪⎭⎪⎬⎪⎩⎪⎨⎠⎞⎜⎝⎛−=∑iiXXXnEnµ2,,1K ()⎪⎭⎪⎬⎫⎪⎩⎪⎨⎧⎟⎠⎞⎜⎝⎛−=∑iiXXXEnnµ2,,211K ()()⎭⎬⎫⎩⎨⎧−−=∑ijjiXXXXEnnµµ,,211K 9⎭⎩ijExample• ML estimator for the mean of a Gaussian N(µ,σ2)()()()[]∑=XXEVarµµµ1ˆ()()()[]∑∑−−=ijjiXXXXEnVarjiµµµ,21•and sinceXXare independent∑=ijijnσ2 ji≠∀=0σ•and since Xi,Xjare independent,jiij≠∀= ,0σ()Vari2221ˆσσµ==∑•thevariance goes to zeroas n increases!()nnii2µ∑10•the variance goes to zeroas n increases!Example• in summary, for ML estimator for the mean of a GaussianN(µ,σ2)[] ˆµµ=E()nVar2ˆσµ=• this means that if I have a large sample, the value of the estimate will be close to the true value with highprobabilityprobabilityµˆµ11nExample• is this always true?• ML estimator for the variance of a Gaussian N(µ,σ2)()()∑∑+−=−=iiiiiXXnXn22221ˆˆ21 ˆ1ˆµµµσthtd li∑−=iiXn22ˆ1 µ•the expected valueis[][][]222ˆ1ˆµσXXiXXXXEXEE−=∑[][][][][][][]2,,22,,2,,,,,,ˆˆ1 11111µµµnninnnXXXXXiXiXXiXXXXEXEEXEnKKKKK−=−=∑∑12[][][][],,,,11nniin∑Example• using[][]XXEXXEE11ˆ2⎥⎤⎢⎡∑∑[][][][]XXEXEXXEnXXnEEijjiXXijjiXXXXjinn11 2,22,,2,,11=⎥⎦⎢⎣=∑∑∑∑KKµ[][][][][]XEXEXEXXEnXEnijijiXXiiXjii11 2,,222++=∑∑∑≠[][][][][][]XEXEXEXEXEnXEnjXijiiXXji11 2,2+=+=∑∑∑≠[][][][][][]XEnXEXEXEXEnXEnijjXiiXXji)1(11 22−+=+=∑∑∑≠13[][][]XEnXEnXEnXiiXXi)1( 2−+=∑Example• using[][][][]222)1(11ˆµXEnXEXEEXiXXXX−+=∑[][][][][][]()222,,)1(1 )1(1µXEnXEXEnXEnXEnEXXXiiXXXXin−+=+∑K[][]()[]22)1(1 µnnXEnnnXXX−+=• we getnn[][][]222ˆˆµσ=EXEE[][][][]222,,,,1111 11σµµσ⎟⎠⎞⎜⎝⎛−=−−−=−=nnnXEnnEXEEXXXXXXnnKK14⎠⎝nnnExample• in summary[]2211ˆ⎞⎜⎛E•the estimator isbiased[]22,,11σσ⎠⎞⎜⎝⎛−=nEnXX Kthe estimator is biased• Q: do we care?– clearlyfl lit i (f ll ti l )bi d[]22,,ˆlim1σσ=∞→nXXnEK–so, for large samplesit is (for all practical purposes) unbiased– what about small samples? the variance is likely to be large to start with, a little bit of bias is not going to make much difference15–so, in practice, it is fineImportant note• since the estimator is a random variable– we can never say that an estimate obtained with more samples is “better”than an estimate from less samples“better”than an estimate from less samples.– e.g., if∑=10011iXµ∑=000,1021iXµwe measure and obtain∑=11100iiXµ∑=12000,10iiXµ10ˆ310ˆis 10.3 a better estimate of m than 10.5? –we can never knowall we know is that5.10ˆ1=µ3.10ˆ2=µ–we can never know, all we know is that()100,21σµµN=⎟⎠⎞⎜⎝⎛=000,10,22σµµN16⎠⎝,Important note– and we can use this to compute()||||22µµµµ−<−P– but there is always a probability that the estimate produced by µ1is better than that produced byµ2()||||22µµµµis better than that produced by µ2– even though µ2has much smaller variance– all that we can hope for, is to make the estimator better in a probabilistic senseprobabilistic sense– this means making()θΘˆPas concentrated as possible

View Full Document