Expectation Maximization, and Learning from Partly Unobserved Data (part 2)OutlineUnsupervised Clustering: K-means and Mixtures of GaussiansClusteringK-means ClusteringK Means appletMixtures of GaussiansEM for Mixture of GaussiansEMEM – E StepEM – M Step EM – M Step Mixture of Gaussians appletK-Means vs Mixture of GaussiansElaboration 1: Downweight the influence of unlabeled examples by factor l Experimental Evaluation20 Newsgroups20 NewsgroupsCombining Labeled and Unlabeled Data3. If Problem Setting Provides Redundantly Sufficient Features, use CoTrainingRedundantly Sufficient FeaturesRedundantly Sufficient FeaturesRedundantly Sufficient FeaturesRedundantly Sufficient FeaturesCoTraining Algorithm #1 [Blum&Mitchell, 1998]CoTraining: Experimental ResultsCoTraining SettingCo-Training Rote LearnerCo-Training Rote LearnerExpected Rote CoTraining error given m examplesHow many unlabeled examples suffice?PAC Generalization Bounds on CoTrainingWhat if CoTraining Assumption Not Perfectly Satisfied?What Objective Function?What Function Approximators?Classifying Jobs for FlipDogGradient CoTraining Classifying FlipDog job descriptions: SysAdmin vs. WebProgrammerGradient CoTraining Classifying Upper Case sequences as Person NamesCoTraining SummaryWhat you should knowExpectation Maximization, andLearning from Partly Unobserved Data(part 2)Machine Learning 10-701April 2005Tom M. MitchellCarnegie Mellon UniversityOutline• Clustering–K means– EM: Mixture of Gaussians• Training classifiers with partially unlabeled data– Naïve Bayes and EM– Reweighting the labeled examples using P(X)– Co-training– Regularization based on1. Unsupervised Clustering: K-means and Mixtures of GaussiansClustering• Given set of data points, group them• Unsupervised learning• Which patients are similar? (or which earthquakes, customers, faces, …)K-means ClusteringGiven data <x1…xn>, and K, assign each xito one of K clusters, C1…CK, minimizingWhere is mean over all points in cluster CjK-Means Algorithm:Initialize randomlyRepeat until convergence:1. Assign each point xito the cluster with the closest mean μj2. Calculate the new mean for each clusterK Means applet• Run applet• Try 3 clusters, 15 ptsMixtures of GaussiansK-means is EM’ish, but makes ‘hard’ assignments of xito clusters. Let’s derive a real EM algorithm for clustering.What object function shall we optimize?• Maximize data likelihood!What form of P(X) should we assume?• Mixture of GaussiansMixture distribution:• Assume P(x) is a mixture of K different Gaussians• Assume each data point, x, is generated by 2-step process1. z Å choose one of the K Gaussians, according to π1…πK2. Generate x according to the Gaussian N(μz, Σz)EM for Mixture of GaussiansSimplify to make this easier 1. assume Xiare conditionally independent given Z. 2. assume only 2 classes, and assume3. Assume σ known, π1…πK, μ1i…μKiunknownObserved: XUnobserved: ZZX1X4X3X2EMGiven observed variables X, unobserved Z Definewhere Iterate until convergence:• E Step: Calculate P(Z(n)|X(n),θ) for each example X(n). Use this to construct • M Step: Replace current θby ZX1X4X3X2EM – E StepCalculate P(Z(n)|X(n),θ) for each observed example X(n)X(n)=<x1(n), x2(n), … xT(n)>.ZX1X4X3X2EM – M Step ZX1X4X3X2First consider update for ππ’ has no influenceCount z(n)=1EM – M Step ZX1X4X3X2Now consider update for μjiμji’ has no influence………Compare above to MLE if Z were observable:Mixture of Gaussians applet• Run applet• Try 2 clusters• See different local minima with different random startsK-Means vs Mixture of Gaussians• Both are iterative algorithms to assign points to clusters• Objective function– K Means: minimize – MixGaussians: maximize P(X|θ)• MixGaussians is the more general formulation– Equivalent to K Means when Σk= σ I, and σ → 0Using Unlabeled Data to Help Train Naïve Bayes ClassifierYX1X4X3X2YX1X2X3X4100110010000010?0110?0101Learn P(Y|X)From [Nigam et al., 2000]E Step:M Step:wtis t-th word in vocabularyElaboration 1: Downweight the influence of unlabeled examples by factor λNew M step:Chosen by cross validationUsing one labeled example per classExperimental Evaluation• Newsgroup postings – 20 newsgroups, 1000/group• Web page classification – student, faculty, course, project– 4199 web pages• Reuters newswire articles – 12,902 articles– 90 topics categories20 Newsgroups20 NewsgroupsCombining Labeled and Unlabeled DataHow else can unlabeled data be useful for supervised learning/function approximation?1. Use U to reweight labeled examples1 if hypothesis hdisagrees with true function f, else 03. If Problem Setting Provides Redundantly Sufficient Features, use CoTraining• In some settings, available data features are redundant and we can train two classifiers using different features• In this case, the two classifiers should at least agree on the classification for each unlabeled example• Therefore, we can use the unlabeled data to constrain training of both classifiers, forcing them to agreeRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorRedundantly Sufficient FeaturesRedundantly Sufficient FeaturesProfessor Faloutsosmy advisorCoTraining Algorithm #1 [Blum&Mitchell, 1998]Given: labeled data L, unlabeled data ULoop:Train g1 (hyperlink classifier) using LTrain g2 (page classifier) using LAllow g1 to label p positive, n negative examps from UAllow g2 to label p positive, n negative examps from U Add these self-labeled examples to LCoTraining: Experimental Results• begin with 12 labeled web pages (academic course)• provide 1,000 additional unlabeled web pages• average error: learning from labeled data 11.1%; • average error: cotraining 5.0%Typical run:CoTraining Setting)()()()(,:22112121xfxgxgxggandondistributiunknownfromdrawnxwhereXXXwhereYXflearn==∀∃×=→•If– x1, x2 conditionally independent given y– f is PAC learnable from noisy labeled data•Then– f is PAC learnable from weak initial classifier plus unlabeled dataCo-Training Rote LearnerMy advisor+--pageshyperlinksCo-Training Rote LearnerMy advisor+--pageshyperlinks----++---++++--++-Expected Rote CoTraining error given m examples[]mjjjgxPgxPerrorE ))(1)(( ∈−∈=∑Where g is the jth connected component of
View Full Document