CMU CS 10701 - Data likelihood for BNs - D2776366

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> Data likelihood for BNs

CMU CS 10701 - Data likelihood for BNs

Course Cs 10701- Introduction to Machine Learning

Pages 42

Download Save

Unformatted text preview:

EM for Bayes Nets Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 17th 2006 1 Data likelihood for BNs Flu Allergy Sinus Given structure log likelihood of fully observed data Nose Headache 2 Marginal likelihood Flu Allergy Sinus What if S is hidden Nose Headache 3 Log likelihood for BNs with hidden data Flu Marginal likelihood O is observed H is hidden Allergy Sinus Nose Headache 4 E step for BNs Flu Allergy Sinus Nose Headache E step computes probability of hidden vars h given o Corresponds to inference in BN 5 Flu The M step for BNs Allergy Sinus Nose Headache Maximization step Use expected counts instead of counts If learning requires Count h o Use EQ t 1 Count h o 6 Flu M step for each CPT Allergy Sinus Nose Headache M step decomposes per CPT Standard M step MLE uses expected counts 7 Flu Computing expected counts Allergy Sinus Nose Headache M step requires expected counts For a set of vars A must compute ExCount A a Some of A in example j will be observed denote by AO aO j Some of A will be hidden denote by AH Use inference E step computes expected counts ExCount t 1 AO aO j AH aH P AH aH AO aO j t 8 Data need not be hidden in the same way Nose Headache A data point is A data point is But unobserved variables can be different for different data points Sinus When data is partially observed Allergy When data is fully observed Flu e g Same framework just change definition of expected counts ExCount t 1 AO aO j AH aH P AH aH AO aO j t 9 What you need to know EM for Bayes Nets E step inference computes expected counts Only need expected counts over Xi and Paxi M step expected counts used to estimate parameters Hidden variables can change per datapoint Use labeled and unlabeled data some data points are complete some include hidden variables 10 Reading Blum Mitchell 1999 see class website Co Training for Semisupervised learning Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 17th 2006 11 Redundant information 12 Redundant information webpage text 13 Redundant information anchor text for hyperlinks 14 Exploiting redundant information in semi supervised learning Want to predict Y from features X f X aY have some labeled data L lots of unlabeled data U Co training assumption X is very expressive X X1 X2 can learn g1 X1 a Y g2 X2 a Y 15 Co Training Key idea Classifier1 and Classifier2 must Correctly classify labeled data Agree on unlabeled data 16 Co Training Algorithm Blum Mitchell 99 17 Co Training experimental results 18 Co Training theory Want to predict Y from features X Co training assumption X is very expressive f X a Y X X1 X2 want to learn g1 X1 a Y and g2 X2 a Y Assumption g1 g2 x g1 x1 f x g2 x2 f x Questions Does unlabeled data always help How many labeled examples do I need How many unlabeled examples do I need 19 Understanding Co Training A simple setting Suppose X1 and X2 are discrete X1 X2 N No label noise Without unlabeled data how hard is it to learn g1 or g2 20 Co Training in simple setting Iteration 0 21 Co Training in simple setting Iteration 1 22 Co Training in simple setting after convergence 23 Co Training in simple setting Connected components Suppose infinite unlabeled data Co training must have at least one labeled example in each connected component of L U graph What s probability of making an error For k Connected components how much labeled data 24 How much unlabeled data 25 Co Training theory Want to predict Y from features X Co training assumption X is very expressive f X a Y X X1 X2 want to learn g1 X1 a Y and g2 X2 a Y Assumption g1 g2 x g1 x1 f x g2 x2 f x One co training result Blum Mitchell 99 If X1 X2 Y g1 g2 are PAC learnable from noisy data and thus f Then f is PAC learnable from weak initial classifier plus unlabeled data 26 What you need to know about cotraining Unlabeled data can help supervised learning a lot when there are mostly independent redundant features One theoretical result If X1 X2 Y and g1 g2 are PAC learnable from noisy data and thus f Then f is PAC learnable from weak initial classifier plus unlabeled data Disagreement between g1 and g2 provides bound on error of final classifier Applied in many real world settings Semantic lexicon generation Riloff Jones 99 Collins Singer 99 Jones 05 Web page classification Blum Mitchell 99 Word sense disambiguation Yarowsky 95 Speech recognition de Sa Ballard 98 Visual classification of cars Levin Viola Freund 03 27 Acknowledgement I would like to thank Tom Mitchell for some of the material used in this presentation of cotraining 28 Reading Vapnik 1998 Joachims 1999 see class website Transductive SVMs Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 17th 2006 29 Semi supervised learning and discriminative models We have seen semin supervised learning for generative models EM What can we do for discriminative models Not regular EM we can t compute P x But there are discriminative versions of EM Co Training Many other tricks let s see an example 30 Linear classifiers Which line is better Data Example i w x j w j x j 31 1 w x b 0 w x b w x b 1 Support vector machines SVMs Solve efficiently by quadratic programming QP margin Well studied solution algorithms Hyperplane defined by support vectors 32 What if we have unlabeled data nL Labeled Data Example i nU Unlabeled Data w x j w j x j 33 Transductive support vector machines TSVMs w x b w 1 x b w m ar gi n x 0 b 1 34 Transductive support vector machines TSVMs w x b w 1 x b w m ar gi n x 0 b 1 35 What s the difference between transductive learning and semi supervised learning Not much and A lot Semi supervised learning Transductive learning labeled and unlabeled data learn w use w on test data same algorithms for labeled and unlabeled data but unlabeled data is test data You are learning on the test data OK because you never look at the labels of the test data can get better classification but be very very very very very very very very careful never use test data prediction accuracy to tune parameters select kernels etc 36 Adding slack variables w x b w 1 x b w m ar gi n x 0 b 1 37 Transductive SVMs now with slack variables Vapnik 98 w x b w 1 x b w m ar gi n x 0 b 1 38 Learning Transductive SVMs is hard w x b w 1 x b w x 0 b 1 Integer Program NP hard Well studied m ar gi n solution algorithms but will not scale up …

View Full Document

CMU CS 10701 - Data likelihood for BNs

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

CMU CS 10701 - Data likelihood for BNs

Sign up for free to view:

Please select your school