Expectation Maximization andExpectation Maximization, andLearning from Partly Unobserved DataMachine Learning 10-601gMarch 3, 2008Tom M. MitchellMachine Learning DepartmentMachine Learning DepartmentCarnegie Mellon University1. Learning Bayes net parameters gy pfrom partly unobserved data,and reviewing Bayes nets concepts for the midterm!for the midterm!Bayes Net with 5 Boolean Variables• How many parameters does this have in itsFluAllergydoes this have in its CPTs?Sinus• How would you learn them from fully HeadacheNosete o uyobserved data?• What if you could only observe F,A,H and N?Learning CPTs•MLE (Max LikelihoodFluAllergy•MLE (Max Likelihood Estimate) isSinusHeadacheNosekthtraining example• Partly unobserved data Æuse EM!Æuse EM!Estimate from partly observed data• What if FAHN observed, but not S?•Can’t calculate MLEFluAllergySinus•Can t calculate MLEHeadacheNose• Let X be all observed variable values (over all examples)• Let Z be all unobserved variable values • Can’t calculate MLE:• EM seeks* estimate:* EM guaranteed to find local maximumEM AlgorithmEM is a general procedure for solving such problemsGiven observed variables X, unobserved Z (X={F,A,H,N}, Z={S})DefineIt t tilIterate until convergence:• E Step: Use X and current θ to estimate P(Z|X,θ)•M Step: Replace currentθby•M Step: Replace current θby Guaranteed to find local maximum. Each iteration increasesE Step: Use X, θ, to Calculate P(Z|X,θ)FluAllergySinus• How? Bayes net inference problem.HeadacheNoseEM and estimating FluAllergySinusHeadacheNoseMore generally, Given observed set X, unobserved set Z of boolean valuesE step: Calculate for each training example, k the expected value of each unobserved variable M step:Cl l i i il MLEb l i hCalculate estimates similar to MLE, but replacing each count by its expected count2. Usupervised clustering: K-means and Mixtures of GaussiansClustering• Given set of data points, group them•Unsupervised learning•Unsupervised learning• Which patients are similar? (or which earthquakes customers faces web pages )earthquakes, customers, faces, web pages, …)K-means ClusteringGiven data <x1… xn>, and K, assign each xito one of K clusters, C1… CK, minimizingWhere is mean over all points in cluster CjK-Means Algorithm:Initialize randomlyRttilRepeat until convergence:1. Assign each point xito the cluster with the closest mean μj2. Calculate the new mean for each clusterMixtures of GaussiansK-means is EM’ish, but makes ‘hard’ assignments of xito clusters. Let’s derive a real EM algorithm for clustering.What object function shall we optimize?What object function shall we optimize?• Maximize data likelihood!What form of P(X) should we assume?Mi t f G i•Mixture of GaussiansMixture of Gaussians:• Assume P(x) is a mixture of K different Gaussians• Then each data point, x, is generated by 2-step process1.zÅchoose one of theKGaussians, according toπ1…πK11.zÅchoose one of the KGaussians, according to π1… πK-12. Generate x according to the Gaussian N(μz, Σz)EM for Mixture of GaussiansSimplify to make this easier: 1. assume X=<X1... Xn>, and the Xiare conditionally independent given Z. g2assume only 2 mixture components and2.assume only 2 mixture components, andZ3. Assume σ known, π1… πK, μ1i…μKiunknownObserved: X=<X1... Xn>Unobserved: ZX1X4X3X21432EMZGiven observed variables X, unobserved Z Definewhere X1X4X3X2Iterate until convergence:• E Step: Calculate P(Z(n)|X(n),θ) for each example X(n). Use this to constructUse this to construct • M Step: Replace current θbyEM – E StepZCalculate P(Z(n)|X(n),θ) for each observed example X(n)X(n)=<x1(n), x2(n), … xT(n)>.()1(),2(),T()X1X4X3X2EM – M Step ZFirst consider update for ππ’ has no influenceX1X4X3X2Count z(n)=1EM – M Step ZNow consider update for μjiμji’ has no influenceX1X4X3X2………Compare above to MLE if Z were observable:EM – putting it togetherZGiven observed variables X, unobserved Z Definewhere X1X4X3X2Iterate until convergence:• E Step: For each observed example X(n), calculate P(Z(n)|X(n),θ)• M Step: UpdateMixture of Gaussians applet• Run applethttp://www neurosci aist go jp/%7Eakaho/MixtureEM htmlhttp://www.neurosci.aist.go.jp/%7Eakaho/MixtureEM.htmlK-Means vs Mixture of Gaussians• Both are iterative algorithms to assign points to clusters• Objective function– K Means: minimize – MixGaussians: maximize P(X|θ)• Mixture of Gaussians is the more general formulation– Equivalent to K Means when Σk= σ I, and σ ! 0Using Unlabeled Data to Help Train Naïve Bayes ClassifierNaïve Bayes ClassifierLearn P(Y|X)YY X1X2X3X4100110010000010X1X4X3X2?0110?0101From [Nigam et al 2000]From [Nigam et al., 2000]E Step:M Step:wtis t-th word in vocabularyElaboration 1: Downweight the influence of unlabeled examples by factorλunlabeled examples by factor λChosen by cross New M step:Chosen by cross validationUi Using one labeled example per ppclassExperimental Evaluation• Newsgroup postings 20 newsgroups 1000/group–20 newsgroups, 1000/group• Web page classification –student faculty course project–student, faculty, course, project– 4199 web pages•Reuters newswire articlesReuters newswire articles – 12,902 articles–90 topics categoriespg20 Newsgroups20 NewsgroupsWhat you should know about EM• For learning from partly unobserved data•MLEst ofθ= MLEst of θ • EM estimate: θ= Where X is observed part of data, Z is unobserved• EM for training Bayes networksCldlMAP ifEM•Can also develop MAP version of EM• Given some time, be able to derive your own EM algorithm for your own problemalgorithm for your own
View Full Document