DOC PREVIEW
CMU CS 10701 - Expectation Maximization, and Learning from Partly Unobserved Data

This preview shows page 1-2-3-21-22-23-43-44-45 out of 45 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 45 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Expectation Maximization and Learning from Partly Unobserved Data part 2 Machine Learning 10 701 April 2005 Tom M Mitchell Carnegie Mellon University Outline Clustering K means EM Mixture of Gaussians Training classifiers with partially unlabeled data Na ve Bayes and EM Reweighting the labeled examples using P X Co training Regularization based on 1 Unsupervised Clustering K means and Mixtures of Gaussians Clustering Given set of data points group them Unsupervised learning Which patients are similar or which earthquakes customers faces K means Clustering Given data x1 xn and K assign each xi to one of K clusters C1 CK minimizing Where is mean over all points in cluster Cj K Means Algorithm Initialize randomly Repeat until convergence 1 Assign each point xi to the cluster with the closest mean j 2 Calculate the new mean for each cluster K Means applet Run applet Try 3 clusters 15 pts Mixtures of Gaussians K means is EM ish but makes hard assignments of xi to clusters Let s derive a real EM algorithm for clustering What object function shall we optimize Maximize data likelihood What form of P X should we assume Mixture of Gaussians Mixture distribution Assume P x is a mixture of K different Gaussians Assume each data point x is generated by 2 step process 1 z choose one of the K Gaussians according to 1 K 2 Generate x according to the Gaussian N z z EM for Mixture of Gaussians Simplify to make this easier 1 assume Xi are conditionally independent given Z 2 assume only 2 classes and assume Z 3 Assume known 1 K 1i Ki unknown Observed X Unobserved Z X1 X2 X3 X4 Z EM Given observed variables X unobserved Z Define where X1 X2 Iterate until convergence E Step Calculate P Z n X n for each example X n Use this to construct M Step Replace current by X3 X4 Z EM E Step Calculate P Z n X n for each observed example X n X n x1 n x2 n xT n X1 X2 X3 X4 First consider update for EM M Step Z has no influence Count z n 1 X1 X2 X3 X4 Now consider update for ji EM M Step Z ji has no influence X1 Compare above to MLE if Z were observable X2 X3 X4 Mixture of Gaussians applet Run applet Try 2 clusters See different local minima with different random starts K Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters Objective function K Means minimize MixGaussians maximize P X MixGaussians is the more general formulation Equivalent to K Means when k I and 0 Using Unlabeled Data to Help Train Na ve Bayes Classifier Learn P Y X Y X1 X2 X3 X4 Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 From Nigam et al 2000 E Step M Step wt is t th word in vocabulary Elaboration 1 Downweight the influence of unlabeled examples by factor New M step Chosen by cross validation Using one labeled example per class Experimental Evaluation Newsgroup postings 20 newsgroups 1000 group Web page classification student faculty course project 4199 web pages Reuters newswire articles 12 902 articles 90 topics categories 20 Newsgroups 20 Newsgroups Combining Labeled and Unlabeled Data How else can unlabeled data be useful for supervised learning function approximation 1 Use U to reweight labeled examples 1 if hypothesis h disagrees with true function f else 0 3 If Problem Setting Provides Redundantly Sufficient Features use CoTraining In some settings available data features are redundant and we can train two classifiers using different features In this case the two classifiers should at least agree on the classification for each unlabeled example Therefore we can use the unlabeled data to constrain training of both classifiers forcing them to agree Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Professor Faloutsos my advisor Redundantly Sufficient Features Redundantly Sufficient Features Professor Faloutsos my advisor CoTraining Algorithm 1 Blum Mitchell 1998 Given labeled data L unlabeled data U Loop Train g1 hyperlink classifier using L Train g2 page classifier using L Allow g1 to label p positive n negative examps from U Allow g2 to label p positive n negative examps from U Add these self labeled examples to L CoTraining Experimental Results begin with 12 labeled web pages academic course provide 1 000 additional unlabeled web pages average error learning from labeled data 11 1 average error cotraining 5 0 Typical run CoTraining Setting learn f X Y where X X 1 X 2 where x drawn from unknown distribution and g1 g 2 x g1 x1 g 2 x2 f x If x1 x2 conditionally independent given y f is PAC learnable from noisy labeled data Then f is PAC learnable from weak initial classifier plus unlabeled data Co Training Rote Learner pages hyperlinks My advisor Co Training Rote Learner pages hyperlinks My advisor Expected Rote CoTraining error given m examples CoTraining setting learn f X Y where X X 1 X 2 where x drawn from unknown distribution and g1 g 2 x g1 x1 g 2 x2 f x E error P x g j 1 P x g j m j Where gj is the jth connected component of graph How many unlabeled examples suffice Want to assure that connected components in the underlying distribution GD are connected components in the observed sample GS GD GS O log N examples assure that with high probability GS has same connected components as GD Karger 94 N is size of GD is min cut over all connected components of GD PAC Generalization Bounds on CoTraining Dasgupta et al NIPS 2001 What if CoTraining Assumption Not Perfectly Satisfied Idea Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem what function should we optimize What Objective Function E E1 E 2 c3 E 3 c4 E 4 E1 2 y g x 1 1 x y L E2 2 y g x 2 2 Error on labeled examples Disagreement over unlabeled x y L E 3 g 1 x1 g 2 x2 2 Misfit to estimated class priors x U 1 1 g x g x 1 1 2 2 E 4 y L x y L L U x L U 2 2 What Function Approximators g 1 x 1 1 e w j 1 x j j g 2 x 1 1 e wj 2x j j Same fn form as Na ve Bayes Max Entropy Use gradient descent to simultaneously learn g1 and g2 directly minimizing E E1 E2 E3 E4 No word independence assumption use both labeled and unlabeled data Classifying Jobs for FlipDog X1 job title X2 job description Gradient CoTraining Classifying FlipDog job descriptions SysAdmin vs WebProgrammer Final Accuracy Labeled data alone 86 CoTraining 96 Gradient CoTraining Classifying Upper Case sequences as Person Names Error Rates Using labeled data only Cotraining Cotraining without fitting class priors E4 25 labeled 2300 labeled 5000 unlabeled 5000 unlabeled …


View Full Document

CMU CS 10701 - Expectation Maximization, and Learning from Partly Unobserved Data

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Expectation Maximization, and Learning from Partly Unobserved Data
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Expectation Maximization, and Learning from Partly Unobserved Data and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Expectation Maximization, and Learning from Partly Unobserved Data and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?