Co Training for Semisupervised learning cont Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 23rd 2007 2005 2007 Carlos Guestrin 1 Exploiting redundant information in semi supervised learning Want to predict Y from features X f X aY have some labeled data L lots of unlabeled data U Co training assumption X is very expressive X X1 X2 can learn g1 X1 a Y g2 X2 a Y 2 2005 2007 Carlos Guestrin Co Training Algorithm Blum Mitchell 99 3 2005 2007 Carlos Guestrin Understanding Co Training A simple setting Suppose X1 and X2 are discrete X1 X2 N No label noise Without unlabeled data how hard is it to learn g1 or g2 4 2005 2007 Carlos Guestrin Co Training in simple setting Iteration 0 5 2005 2007 Carlos Guestrin Co Training in simple setting Iteration 1 6 2005 2007 Carlos Guestrin Co Training in simple setting after convergence 7 2005 2007 Carlos Guestrin Co Training in simple setting Connected components Suppose infinite unlabeled data Co training must have at least one labeled example in each connected component of L U graph What s probability of making an error For k Connected components how much labeled data 8 2005 2007 Carlos Guestrin How much unlabeled data 9 2005 2007 Carlos Guestrin Co Training theory Want to predict Y from features X Co training assumption X is very expressive f X a Y X X1 X2 want to learn g1 X1 a Y and g2 X2 a Y Assumption g1 g2 x g1 x1 f x g2 x2 f x One co training result Blum Mitchell 99 If X1 X2 Y g1 g2 are PAC learnable from noisy data and thus f Then f is PAC learnable from weak initial classifier plus unlabeled data 10 2005 2007 Carlos Guestrin What you need to know about cotraining Unlabeled data can help supervised learning a lot when there are mostly independent redundant features One theoretical result If X1 X2 Y and g1 g2 are PAC learnable from noisy data and thus f Then f is PAC learnable from weak initial classifier plus unlabeled data Disagreement between g1 and g2 provides bound on error of final classifier Applied in many real world settings Semantic lexicon generation Riloff Jones 99 Collins Singer 99 Jones 05 Web page classification Blum Mitchell 99 Word sense disambiguation Yarowsky 95 Speech recognition de Sa Ballard 98 Visual classification of cars Levin Viola Freund 03 2005 2007 Carlos Guestrin 11 Transductive SVMs Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 23rd 2007 2005 2007 Carlos Guestrin 12 Semi supervised learning and discriminative models We have seen semi supervised learning for generative models EM What can we do for discriminative models Not regular EM we can t compute P x But there are discriminative versions of EM Co Training Many other tricks let s see an example 13 2005 2007 Carlos Guestrin Linear classifiers Which line is better Data Example i w x j w j x j 14 2005 2007 Carlos Guestrin 1 w x b 0 w x b w x b 1 Support vector machines SVMs Solve efficiently by quadratic programming QP Well studied solution algorithms Hyperplane defined by support vectors m a r g in 15 2005 2007 Carlos Guestrin What if we have unlabeled data nL Labeled Data Example i nU Unlabeled Data w x j w j x j 16 2005 2007 Carlos Guestrin Transductive support vector machines TSVMs x w b x w m ar gi n 1 b x w 0 b 1 17 2005 2007 Carlos Guestrin Transductive support vector machines TSVMs x w b x w m ar gi n 1 b x w 0 b 1 18 2005 2007 Carlos Guestrin What s the difference between transductive learning and semi supervised learning Not much and A lot Semi supervised learning Transductive learning labeled and unlabeled data learn w use w on test data same algorithms for labeled and unlabeled data but unlabeled data is test data You are learning on the test data OK because you never look at the labels of the test data can get better classification but be very very very very very very very very careful never use test data prediction accuracy to tune parameters select kernels etc 19 2005 2007 Carlos Guestrin Adding slack variables x w b x w m ar gi n 1 b x w 0 b 1 20 2005 2007 Carlos Guestrin Transductive SVMs now with slack variables Vapnik 98 w x b 1 x w w m ar gi n x b b 1 0 21 2005 2007 Carlos Guestrin Learning Transductive SVMs is hard w x b x w 1 0b x b w 1 Integer Program NP hard Well studied m ar gi solution algorithms but will not scale up to very large problems n 22 2005 2007 Carlos Guestrin A heuristic learning algorithm for Transductive SVMs Joachims 99 w x b x w 1 0b x b w 1 If you set to zero ignore unlabeled data Intuition of algorithm m ar gi n start with small add labels to some unlabeled data based on classifier prediction slowly increase keep on labeling unlabeled data and re running classifier 23 2005 2007 Carlos Guestrin Some results classifying news articles from Joachims 99 24 2005 2007 Carlos Guestrin What you need to know about transductive SVMs What is transductive v semi supervised learning Formulation for transductive SVM can also be used for semi supervised learning Optimization is hard Integer program There are simple heuristic solution methods that work well here 25 2005 2007 Carlos Guestrin Dimensionality reduction Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University April 23rd 2007 2005 2007 Carlos Guestrin 26 Dimensionality reduction Input data may have thousands or millions of dimensions e g text data has Dimensionality reduction represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional 27 2005 2007 Carlos Guestrin Feature selection Want to learn f XaY X X1 Xn but some features are more important than others Approach select subset of features to be used by learning algorithm Score each feature or sets of features Select set of features with best score 28 2005 2007 Carlos Guestrin Simple greedy forward feature selection algorithm Pick a dictionary of features e g polynomials for linear regression Greedy heuristic Start from empty or simple set of features F0 Run learning algorithm for current set of features Ft Obtain ht Select next best feature Xi e g Xj that results in lowest crossvalidation error learner when learning with Ft Xj Ft 1 Ft Xi Recurse 29 2005 2007 Carlos Guestrin Simple greedy backward feature selection algorithm Pick a dictionary of features e g polynomials for linear regression Greedy heuristic Start from all features F0 F Run learning algorithm for current set of features Ft
View Full Document