©2005-2007 Carlos Guestrin1Co-Training for Semi-supervised learning(cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 23rd, 20072©2005-2007 Carlos GuestrinExploiting redundant information in semi-supervised learning Want to predict Y from features X f(X) Y have some labeled data L lots of unlabeled data U Co-training assumption: X is very expressive X = (X1,X2) can learn g1(X1) Y g2(X2) Y3©2005-2007 Carlos GuestrinCo-Training Algorithm [Blum & Mitchell ’99]4©2005-2007 Carlos GuestrinUnderstanding Co-Training: A simple setting Suppose X1and X2are discrete |X1| = |X2| = N No label noise Without unlabeled data, how hard is it to learn g1(or g2)?5©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 06©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 17©2005-2007 Carlos GuestrinCo-Training in simple setting – after convergence8©2005-2007 Carlos GuestrinCo-Training in simple setting –Connected components Suppose infinite unlabeled data Co-training must have at least one labeled example in each connected component of L+U graph What’s probability of making an error? For k Connected components, how much labeled data?9©2005-2007 Carlos GuestrinHow much unlabeled data?10©2005-2007 Carlos GuestrinCo-Training theory Want to predict Y from features X f(X) Y Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) Y and g2(X2) Y Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) = f(x) One co-training result [Blum & Mitchell ’99] If (X1⊥ X2| Y) g1& g2are PAC learnable from noisy data (and thus f) Then f is PAC learnable from weak initial classifier plus unlabeled data11©2005-2007 Carlos GuestrinWhat you need to know about co-training Unlabeled data can help supervised learning (a lot) when there are (mostly) independent redundant features One theoretical result: If (X1⊥ X2| Y) and g1& g2are PAC learnable from noisy data (and thus f) Then f is PAC learnable from weak initial classifier plus unlabeled data Disagreement between g1and g2provides bound on error of final classifier Applied in many real-world settings: Semantic lexicon generation [Riloff, Jones 99] [Collins, Singer 99], [Jones 05] Web page classification [Blum, Mitchell 99] Word sense disambiguation [Yarowsky 95] Speech recognition [de Sa, Ballard 98] Visual classification of cars [Levin, Viola, Freund 03]12©2005-2007 Carlos GuestrinAnnouncements Poster Session Friday, May 4th2-5pm, NSH Atrium It will be fun… ☺ There will be a prize!!! Popular vote Print your posters early!!! Please, please, please fill in your FCEs please©2005-2007 Carlos Guestrin13Transductive SVMsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 23rd, 200714©2005-2007 Carlos GuestrinSemi-supervised learning and discriminative models We have seen semi-supervised learning for generative models EM What can we do for discriminative models Not regular EM we can’t compute P(x) But there are discriminative versions of EM Co-Training! Many other tricks… let’s see an example15©2005-2007 Carlos GuestrinLinear classifiers – Which line is better?Data:Example i:w.x = ∑jw(j)x(j)16©2005-2007 Carlos GuestrinSupport vector machines (SVMs)w.x+ b = +1w.x+ b = -1w.x+ b = 0margin γ Solve efficiently by quadratic programming (QP) Well-studied solution algorithms Hyperplane defined by support vectors17©2005-2007 Carlos GuestrinWhat if we have unlabeled data?nLLabeled Data:Example i:w.x = ∑jw(j)x(j)nUUnlabeled Data:18©2005-2007 Carlos GuestrinTransductive support vector machines (TSVMs)w.x+ b = +1w.x+ b = -1w.x+ b = 0margin γ19©2005-2007 Carlos GuestrinTransductive support vector machines (TSVMs)w.x+ b = +1w.x+ b = -1w.x+ b = 0margin γ20©2005-2007 Carlos GuestrinWhat’s the difference between transductivelearning and semi-supervised learning? Not much, and A lot!!! Semi-supervised learning: labeled and unlabeled data → learn w use w on test data Transductive learning same algorithms for labeled and unlabeled data, but… unlabeled data is test data!!! You are learning on the test data!!! OK, because you never look at the labels of the test data can get better classification but be very very very very very very very very careful!!! never use test data prediction accuracy to tune parameters, select kernels, etc.21©2005-2007 Carlos GuestrinAdding slack variablesw.x+ b = +1w.x+ b = -1w.x+ b = 0margin γ22©2005-2007 Carlos GuestrinTransductive SVMs – now with slack variables! [Vapnik 98] w.x+ b = +1w.x + b = -1w.x+ b = 0margin γ23©2005-2007 Carlos GuestrinLearning Transductive SVMs is hard!w.x + b = +1w.x + b = -1w.x+ b = 0margin γ Integer Program NP-hard!!! Well-studied solution algorithms, but will not scale up to very large problems24©2005-2007 Carlos GuestrinA (heuristic) learning algorithm for Transductive SVMs [Joachims 99]w.x + b = +1w.x + b = -1w.x+ b = 0margin γ If you set to zero → ignore unlabeled data Intuition of algorithm: start with small add labels to some unlabeled data based on classifier prediction slowly increase keep on labeling unlabeled data and re-running classifier25©2005-2007 Carlos GuestrinSome results classifying news articles – from [Joachims 99]26©2005-2007 Carlos GuestrinWhat you need to know about transductive SVMs What is transductive v. semi-supervised learning Formulation for transductive SVM can also be used for semi-supervised learning Optimization is hard! Integer program There are simple heuristic solution methods that work well here©2005-2007 Carlos Guestrin27Dimensionality reductionMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 23rd, 200728©2005-2007 Carlos GuestrinDimensionality reduction Input data may have thousands or millions of dimensions! e.g., text data has Dimensionality reduction: represent data with fewer dimensions easier learning – fewer parameters visualization – hard to visualize more than 3D or 4D discover “intrinsic dimensionality” of data high dimensional data that is truly lower dimensional29©2005-2007 Carlos GuestrinFeature selection Want to learn f:XY X=<X1,…,Xn> but some
View Full Document