UMD CMSC 828 - Supervised Learning - D283447

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828> Supervised Learning

UMD CMSC 828 - Supervised Learning

School name University of Maryland, College Park

Course Cmsc 828- Advanced Topics in Information Processing

Pages 22

Download Save

Unformatted text preview:

AnnouncementsSupervised LearningSlide 3Linear DiscrimantsWhy linear discriminants?Linearly Separable ClassesSlide 7Perceptron AlgorithmPerceptron IntuitionSupport Vector MachinesMaximum MarginGeometric IntuitionsSlide 13Non-statistical learningOn-Line LearningVC DimensionVC Dimension and worst case learningWinnowWinnow AlgorithmSome intuitionsTheoremWinnow SummaryAnnouncements•See Chapter 5 of Duda, Hart, and Stork.•Tutorial by Burge linked to on web page.•“Learning quickly when irrelevant attributes abound,” by Littlestone, Machine Learning 2:285-318, 1988.Supervised Learning•Classification with labeled examples.•Images vectors in high-D space.Supervised Learning•Labeled examples called training set.•Query examples called test set.•Training and test set must come from same distribution.Linear Discrimants•Images represented as vectors, x1, x2, …. •Use these to find hyperplane defined by vector w and w0.x is on hyperplane: wTx + w0 = 0.•Notation: aT = [w0, w1, …]. yT = [1,x1,x2, …]So hyperplane is aTy=0.•A query, q, is classified based on whether aTq > 0 or aTq < 0.Why linear discriminants?•Optimal if classes are Gaussian with same covariances.•Linear separators easier to find.•Hyperplanes have few parameters, prevents overfitting.–Have lower VC dimension, but we don’t have time for this.Linearly Separable ClassesLinearly Separable Classes•For one set of classes, aTy > 0. For other set: aTy < 0. •Notationally convenient if, for second class, make y negative.•Then, finding a linear separator means finding a such that, for all i, aTy > 0.•Note, this is a linear program.–Problem convex, so descent algorithms converge to global optimum.Perceptron AlgorithmYyTPyaaJ )()(YyPyJ )(Perceptron Error FunctionY is set of misclassified vectors.So update a by:Yyykaka)()1(Simplify by cycling through y and whenever one is misclassified, update a  a + cy. This converges after finite # of steps.Perceptron Intuitiona(k)a(k+1)Support Vector Machines•Extremely popular.•Find linear separator with maximum margin.– Some guarantees this generalizes well.•Can work in high-dimensional space without overfitting.–Nonlinear map to high-dim. space, then find linear separator. –Special tricks allow efficiency in ridiculously high dimensional spaces.•Can handle non-separable classes also.–Not as important if space very high-dimensional.Maximum MarginMaximize the minimum distance from hyperplane to points.Points at this minimum distance are support vectors.Geometric Intuitions• Maximum margin between points -> Maximum margin between convex sets21212121or )1(10)1(apapapapapapapkkapapkpkkppThis implies max margin hyperplane is orthogonal to vector connecting nearest points of convex sets, and halfway between.Non-statistical learning•There are a class of functions that could label the data.•Our goal is to select the correct function, with as little information as possible.•Don’t think of data coming from a class described by probability distributions.•Look at worst-case performance.–This is CS’ey approach.–In statistical model, worst case not meaningful.On-Line Learning•Let X be a set of objects (eg., vectors in a high-dimensional space).•Let C be a class of possible classifying functions (eg., hyperplanes).–f in C: X-> {0,1}–One of these correctly classifies all data.•The learner is asked to classify an item in X, then told the correct class.•Eventually, learner determines correct f.–Measure number of mistakes made.–Worst case bound for learning strategy.VC Dimension•S, a subset of X, is shattered by C if, for any U, a subset of S, there exists f in C such that f is 1 on U and 0 on S-U.•The VC Dimension of C is the size of the largest set shattered by C.VC Dimension and worst case learning•Any learning strategy makes VCdim(C) mistakes in the worst case.–If S is shattered by C–Then for any assignment of values to S, there is an f in C that makes this assignment.–So any set of choices the learner makes for S can be entirely wrong.Winnow•x’s elements have binary values.•Find weights. Classify x by whether wTx > 1 or wTx < 1.•Algorithm requires that there exist weights u, such that:–uTx > 1 when f(x) = 1–uTx < 1 –  when f(x) = 0.–That is, there is a margin of .Winnow Algorithm•Initialize w = (1,1, …1).•Let    •Decision: wTx > •If learner correct, weights don’t change. If wrong:Learner’s PredictionCorrectResponseUpdateActionUpdatename1 0wi:=wi/if xi=1wi unchanged if xi=0 DemotionStep0 1wi:=wiif xi=1wi unchanged if xi=0PromotionstepSome intuitions•Note that this is just like Perceptron, but with multiplicative change, not additive.•Moves weights in right direction; –if wTx was too big, shrink weights that effect inner product.–If wTx too small, make weights bigger.•Weights change more rapidly (exponential with mistakes, not linear).Theoremniiun122ln1458niiun122ln1458Number of errors bounded by:Set = nNote: if xi is an irrelevant feature, ui = 0.Errors grow logarithmically with irrelevant features. Empirically, Perceptron errors grow linearly.This is optimal for k-monotone disjunctions:f(x1, … xn) = xi1 V xi2 V … V xikWinnow Summary•Simple algorithm; variation on Perceptron.•Great with irrelevant attributes.•Optimal for monotone

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 828 - Supervised Learning

Sign up for free to view:

Please select your school