11Parameter Learning 2Graphical Models – 10708Carlos GuestrinCarnegie Mellon UniversitySeptember 27th, 2006Readings:K&F: 14.1, 14.2, 14.3, 14.4, 15.1, 15.2, 15.3.1, 15.4.110-708 –Carlos Guestrin 20062Your first learning algorithm Set derivative to zero:210-708 –Carlos Guestrin 20063Learning the CPTsx(1)…x(m)DataFor each discrete variable Xi10-708 –Carlos Guestrin 20064Maximum likelihood estimation (MLE) of BN parameters – General case Data: x(1),…,x(m) Restriction: x(j)[PaXi] → assignment to PaXiin x(j) Given structure, log likelihood of data:310-708 –Carlos Guestrin 20065Taking derivatives of MLE of BN parameters – General case10-708 –Carlos Guestrin 20066General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT Parameter θX=x|U=u:410-708 –Carlos Guestrin 20067Announcements Late homeworks: 3 late days for the semester one late day corresponds to 24 hours! (i.e., 3 late days due Saturday by noon) Give late homeworks to Monica Hopes, Wean Hall 4619 If she is not in her office, time stamp (date and time) your homework, sign it, and put it under her door After late days are used up: Half credit within 48 hours Zero credit after 48 hours All homeworks must be handed in, even for zero credit Homework 2 out later today Recitation tomorrow: review perfect maps, parameter learning10-708 –Carlos Guestrin 20068mCan we really trust MLE? What is better? 3 heads, 2 tails 30 heads, 20 tails 3x1023heads, 2x1023tails Many possible answers, we need distributions over possible parameters510-708 –Carlos Guestrin 20069Bayesian Learning Use Bayes rule: Or equivalently:10-708 –Carlos Guestrin 200610Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior? Represent expert knowledge Simple posterior form Conjugate priors: Closed-form representation of posterior (more details soon) For Binomial, conjugate prior is Beta distribution610-708 –Carlos Guestrin 200611Beta prior distribution – P(θ) Likelihood function: Posterior:10-708 –Carlos Guestrin 200612Posterior distribution Prior: Data: mHheads and mTtails Posterior distribution:710-708 –Carlos Guestrin 200613Conjugate prior Given likelihood function P(D|θ) (Parametric) prior of the form P(θ|α) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as: P(θ|α’), for some new set of parameters α’ Prior: Data: mHheads and mTtails (binomial likelihood) Posterior distribution: 10-708 –Carlos Guestrin 200614Using Bayesian posterior Posterior distribution: Bayesian inference: No longer single parameter: Integral is often hard to compute810-708 –Carlos Guestrin 200615Bayesian prediction of a new coin flip Prior: Observed mHheads, mTtails, what is probability of m+1 flip is heads?10-708 –Carlos Guestrin 200616Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips: As m → ∞, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size: Prior parameterized by αH,αT, or m’ (equivalent sample size) and αFix m’, change αFix α, change m’910-708 –Carlos Guestrin 200617Bayesian learning corresponds to smoothing m=0 ⇒ prior parameter m→∞ ⇒ MLE m10-708 –Carlos Guestrin 200618Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial: Conjugate prior for multinomial is Dirichlet: Observe m data points, mifrom assignment i, posterior: Prediction:1010-708 –Carlos Guestrin 200619Bayesian learning for two-node BN Parameters θX, θY|X Priors: P(θX): P(θY|X):10-708 –Carlos Guestrin 200620Very important assumption on prior:Global parameter independence Global parameter independence: Prior over parameters is product of prior over CPTs1110-708 –Carlos Guestrin 200621Global parameter independence, d-separation and local predictionFluAllergySinusHeadacheNose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then 10-708 –Carlos Guestrin 200622Within a CPT Meta BN including CPT parameters: Are θY|X=tand θY|X=fd-separated given D? Are θY|X=tand θY|X=findependent given D? Context-specific independence!!! Posterior decomposes:1210-708 –Carlos Guestrin 200623Priors for BN CPTs(more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior: Dirichlet(αX=1|U=u,…, αX=k|U=u) More intuitive: “prior data set” D’ with m’ equivalent sample size “prior counts”: prediction:10-708 –Carlos Guestrin 200624An example1310-708 –Carlos Guestrin 200625What you need to know about parameter learning MLE: score decomposes according to CPTs optimize each CPT separately Bayesian parameter learning: motivation for Bayesian approach Bayesian prediction conjugate priors, equivalent sample size Bayesian learning ⇒ smoothing Bayesian learning for BN parameters Global parameter independence Decomposition of prediction according to CPTs Decomposition within a CPT10-708 –Carlos Guestrin 200626Where are we with learning BNs? Given structure, estimate parameters Maximum likelihood estimation Bayesian learning What about learning structure?1410-708 –Carlos Guestrin 200627Learning the structure of a BN Constraint-based approach BN encodes conditional independencies Test conditional independencies in data Find an I-map Score-based approach Finding a structure and parameters is a density estimation task Evaluate model as we evaluated parameters Maximum likelihood Bayesian etc. Data<x1(1),…,xn(1)>…<x1(m),…,xn(m)>FluAllergySinusHeadacheNoseLearn structure andparameters10-708 –Carlos Guestrin 200628Remember: Obtaining a P-map? Given the independence assertions that are true for P Obtain skeleton Obtain immoralities From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class Constraint-based approach: Use Learn PDAG algorithm Key question: Independence test1510-708 –Carlos Guestrin 200629Independence tests
View Full Document