©2005-2007 Carlos Guestrin1Co-Training for Semi-supervised learning(cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 23rd, 20072©2005-2007 Carlos GuestrinExploiting redundant information insemi-supervised learning Want to predict Y fromfeatures X f(X) a Y have some labeled data L lots of unlabeled data U Co-training assumption: X isvery expressive X = (X1,X2) can learn g1(X1) a Y g2(X2) a Y3©2005-2007 Carlos GuestrinCo-Training Algorithm[Blum & Mitchell ’99]4©2005-2007 Carlos GuestrinUnderstanding Co-Training: Asimple setting Suppose X1 and X2 are discrete |X1| = |X2| = N No label noise Without unlabeled data, how hard is it to learn g1 (or g2)?5©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 06©2005-2007 Carlos GuestrinCo-Training in simple setting –Iteration 17©2005-2007 Carlos GuestrinCo-Training in simple setting – afterconvergence8©2005-2007 Carlos GuestrinCo-Training in simple setting –Connected components Suppose infinite unlabeled data Co-training must have at least one labeledexample in each connected component of L+Ugraph What’s probability of making an error? For k Connected components, how muchlabeled data?9©2005-2007 Carlos GuestrinHow much unlabeled data?10©2005-2007 Carlos GuestrinCo-Training theory Want to predict Y from features X f(X) a Y Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) a Y and g2(X2) a Y Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) = f(x) One co-training result [Blum & Mitchell ’99] If (X1 ⊥ X2 | Y) g1 & g2 are PAC learnable from noisy data (and thus f) Then f is PAC learnable from weak initial classifier plus unlabeled data11©2005-2007 Carlos GuestrinWhat you need to know about co-training Unlabeled data can help supervised learning (a lot) whenthere are (mostly) independent redundant features One theoretical result: If (X1 ⊥ X2 | Y) and g1 & g2 are PAC learnable from noisy data(and thus f) Then f is PAC learnable from weak initial classifier plusunlabeled data Disagreement between g1 and g2 provides bound on error of finalclassifier Applied in many real-world settings: Semantic lexicon generation [Riloff, Jones 99] [Collins, Singer 99],[Jones 05] Web page classification [Blum, Mitchell 99] Word sense disambiguation [Yarowsky 95] Speech recognition [de Sa, Ballard 98] Visual classification of cars [Levin, Viola, Freund 03]©2005-2007 Carlos Guestrin12Transductive SVMsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 23rd, 200713©2005-2007 Carlos GuestrinSemi-supervised learning anddiscriminative models We have seen semi-supervised learning forgenerative models EM What can we do for discriminative models Not regular EM we can’t compute P(x) But there are discriminative versions of EM Co-Training! Many other tricks… let’s see an example14©2005-2007 Carlos GuestrinLinear classifiers – Which line is better?Data:Example i:w.x = ∑j w(j) x(j)15©2005-2007 Carlos GuestrinSupport vector machines (SVMs)w.x + b = +1w.x + b = -1w.x + b = 0margin γ Solve efficiently by quadraticprogramming (QP) Well-studied solution algorithms Hyperplane defined by supportvectors16©2005-2007 Carlos GuestrinWhat if we have unlabeled data?nL Labeled Data:Example i:w.x = ∑j w(j) x(j)nU Unlabeled Data:17©2005-2007 Carlos GuestrinTransductive support vectormachines (TSVMs)w.x + b = +1w.x + b = -1w.x + b = 0margin γ18©2005-2007 Carlos GuestrinTransductive support vectormachines (TSVMs)w.x + b = +1w.x + b = -1w.x + b = 0margin γ19©2005-2007 Carlos GuestrinWhat’s the difference between transductivelearning and semi-supervised learning? Not much, and A lot!!! Semi-supervised learning: labeled and unlabeled data ! learn w use w on test data Transductive learning same algorithms for labeled and unlabeled data, but… unlabeled data is test data!!! You are learning on the test data!!! OK, because you never look at the labels of the test data can get better classification but be very very very very very very very very careful!!! never use test data prediction accuracy to tune parameters, select kernels, etc.20©2005-2007 Carlos GuestrinAdding slack variablesw.x + b = +1w.x + b = -1w.x + b = 0margin γ21©2005-2007 Carlos GuestrinTransductive SVMs – now with slackvariables! [Vapnik 98]w.x + b = +1w.x + b = -1w.x + b = 0margin γ22©2005-2007 Carlos GuestrinLearning Transductive SVMs is hard!w.x + b = +1w.x + b = -1w.x + b = 0margin γ Integer Program NP-hard!!! Well-studied solution algorithms,but will not scale up to very largeproblems23©2005-2007 Carlos GuestrinA (heuristic) learning algorithm forTransductive SVMs [Joachims 99]w.x + b = +1w.x + b = -1w.x + b = 0margin γ If you set to zero → ignore unlabeled data Intuition of algorithm: start with small add labels to some unlabeled data based on classifierprediction slowly increase keep on labeling unlabeled data and re-runningclassifier24©2005-2007 Carlos GuestrinSome results classifying newsarticles – from [Joachims 99]25©2005-2007 Carlos GuestrinWhat you need to know abouttransductive SVMs What is transductive v. semi-supervised learning Formulation for transductive SVM can also be used for semi-supervised learning Optimization is hard! Integer program There are simple heuristic solution methods thatwork well here©2005-2007 Carlos Guestrin26DimensionalityreductionMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityApril 23rd, 200727©2005-2007 Carlos GuestrinDimensionality reduction Input data may have thousands or millions ofdimensions! e.g., text data has Dimensionality reduction: represent data withfewer dimensions easier learning – fewer parameters visualization – hard to visualize more than 3D or 4D discover “intrinsic dimensionality” of data high dimensional data that is truly lower dimensional28©2005-2007 Carlos GuestrinFeature selection Want to learn f:XaY X=<X1,…,Xn> but some features are more important than others Approach: select subset of features to be usedby learning algorithm Score each feature (or sets of features) Select set of features with best score29©2005-2007 Carlos GuestrinSimple greedy forward feature
View Full Document