DOC PREVIEW
CMU CS 10708 - Lecture

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 19 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Probabilistic Graphical Models10-708Statistical learning with basic Statistical learning with basic graphical modelsgraphical modelsEric Xing Eric Xing Lecture 9, Oct 10, 2005Reading: MJ-Chap. 5,6Learning Graphical ModelsThe goal:Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..(B,E,A,C,R)=(F,T,T,T,F)ERBAC0.9 0.1ebe0.2 0.80.01 0.990.9 0.1bebbeBEP(A | E,B)ERBACStructural learningParameter learning2Parameter Learningz Assume G is known and fixed,z from expert designz from an intermediate outcome of iterative structure learningz Goal: estimate from a dataset of Nindependent, identically distributed (iid) training cases D = {x1, . . . , xN}.z In general, each training case xn= (xn,1 , . . . , xn,M) is a vector of Mvalues, one per node,z the model can be completely observable, i.e., every element in xnis known (no missing values, no hidden variables),z or, partially observable, i.e., ∃i, s.t. xn,iis not observed. z Frequentist vs. Bayesian estimatez Initially we consider learning parameters for a single node.z Then we consider how to learn parameters for larger GMs.Bayesian Parameter Estimationz Bayesians treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule:z This crucial equation can be written in words:z For iid data, the likelihood isz The prior p(.) encodes our prior knowledge about the domainz therefore Bayesian estimation has been criticized for being "subjective" z empirical Bayes – fit prior from the data …∫==θθθθθθθθdpDppDpDppDpDp)()|()()|()()()|()|(likelihood marginalpriorlikelihoodposterior×=∏==NnnxpDp1)|()|(θθ3Frequentist Parameter Estimation Two people with different priors p(θ) will end up with different estimates p(θ|D).z Frequentists dislike this “subjectivity”.z Frequentists think of the parameter as a fixed, unknown constant, not a random variable.z Hence they have to come up with different "objective" estimators(ways of computing from data), instead of using Bayes’ rule.z These estimators have different properties, such as being “unbiased”, “minimum variance”, etc.z A very popular estimator is the maximum likelihood estimator, which is simple and has good statistical properties.Maximum Likelihood Estimationz The log-likelihood is monotonically related to the likelihood:z The Idea underlying maximum likelihood estimation (MLE): pick the setting of parameters most likely to have generated the data we saw:z Problem of MLE: z Often the MLE overfits the training data, so it is common to maximize a regularized log-likelihood instead:z Sometimes the MLE underfits the training data (e.g., certain possible values are not observed due to data sparsity), so it is common to smooth the estimated parameter ∑===NnnxpDpD1)|(log)|(log);(θθθl);(maxargDMLθθθl =))();(maxarg'θθθθcD−=l )4Being a pragmatic frequentistz Maximum a posteriori (MAP) estimation:z Smoothing with pseudo-countsz Recall that for Binomial Distribution, we havez What if we tossed too few times so that we saw zero head?We have and we will predict that the probability of seeing a head next is zero!!! z The rescue: z Where n'is know as the pseudo- (imaginary) count)(log);(maxarg)|(maxargθθθθθθpDDpMAP+==l )tailheadheadheadMLnnn+=θ) ,0=headMLθ)''nnnnntailheadheadheadML+++=θ)But are we still objective?How estimators should be used?z is not Bayesian (even though it uses a prior) since it is a point estimate.z Consider predicting the future. A sensible way is to combine predictions based on all possible values of θ, weighted by their posterior probability, this is what a Bayesian will do:z A frequentist will typically use a “plug-in” estimator such as ML/MAP:z The Bayesian estimate will collapse to MAP for concentrated posterior MAPθ)θθθθθθθθdpxpdpxpdxpxpnewnewnewnew)|()|()|(),|()|,()|(xxxxx∫∫∫=== )|()|( or, ),|()|(MAPnewnewMLnewnewxpxpxpxp θθ))== xx5Frequentist vs. Beyesianz This is a “theological” war.z Advantages of Bayesian approach:z Mathematically elegant.z Works well when amount of data is much less than number of parameters (e.g., one-shot learning).z Easy to do incremental (sequential) learning.z Can be used for model selection (max likelihood will always pick the most complex model).z Advantages of frequentist approach:z Mathematically/ computationally simpler.z "objective", unbiased, invariant to reparameterizationz As the two approaches become the same:,|| ∞→D),()|(MLDp θθδθ)→MLE for general BNsz If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:∑∑∏∏⎟⎠⎞⎜⎝⎛=⎟⎟⎠⎞⎜⎜⎝⎛==inininniininiixpxpDpD),|(log),|(log)|(log);(,,,,θθθθππxxl6EarthquakeRadioBurglaryAlarmCall∏===Miiixpp1)|()(πxxXFactorization:jikiixjkixpπθπxx|)|( =Local Distributions defined by, e.g., multinomial parameters:How to define parameter prior?Assumptions (Geiger & Heckerman 97,99):z Complete Model Equivalencez Global Parameter Independencez Local Parameter Independencez Likelihood and Prior Modularity? )|(Gpθ Global Parameter IndependenceFor every DAG model: Local Parameter IndependenceFor every node:∏==MiimGpGp1)|()|(θθEarthquakeRadioBurglaryAlarmCall∏==ijikiqjxiGpGp1)|()|(|πθθxindependent of)(| YESAlarmCallP=θ)(| NOAlarmCallP=θGlobal & Local Parameter Independence7Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!sample 1sample 2Mθ2|1θ1θ2|1X1X2X1X2Global ParameterIndependenceLocal ParameterIndependenceParameter Independence,Graphical ViewWhich PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)z Discrete DAG Models:Dirichlet prior:z Gaussian DAG Models:Normal prior:Normal-Wishart prior:∏∏∏∑--)()()()(kkkkkkkkkkCP11ααθαθααθ=ΓΓ=)(Multi~|θπjxiix),(Normal~| Σµπjxiix⎭⎬⎫⎩⎨⎧−Ψ−−Ψ=Ψ−)()'(exp||)(),|(//νµνµπνµ12122121np{}. where ,WtrexpW),(),|W(/)(/121221−−−Σ=⎭⎬⎫⎩⎨⎧ΤΤ=ΤWnwwwwncpαααα(),)(,Normal),,|(1−= WWµµανανµp8An (incomplete) genealogy of graphical modelsThe structures of most GMs (e.g., all listed here), are not learned from data, but designed by human.But such designs are useful and


View Full Document

CMU CS 10708 - Lecture

Documents in this Course
Lecture

Lecture

15 pages

Lecture

Lecture

25 pages

Lecture

Lecture

24 pages

causality

causality

53 pages

lecture11

lecture11

16 pages

Exam

Exam

15 pages

Notes

Notes

12 pages

lecture

lecture

18 pages

lecture

lecture

16 pages

Lecture

Lecture

17 pages

Lecture

Lecture

15 pages

Lecture

Lecture

17 pages

Lecture

Lecture

42 pages

Lecture

Lecture

16 pages

r6

r6

22 pages

lecture

lecture

20 pages

lecture

lecture

35 pages

Lecture

Lecture

19 pages

Lecture

Lecture

21 pages

lecture

lecture

21 pages

lecture

lecture

13 pages

review

review

50 pages

Semantics

Semantics

30 pages

lecture21

lecture21

26 pages

MN-crf

MN-crf

20 pages

hw4

hw4

5 pages

lecture

lecture

12 pages

Lecture

Lecture

25 pages

Lecture

Lecture

25 pages

Lecture

Lecture

14 pages

Lecture

Lecture

15 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?