Rutgers University CS 536 - INTRODUCTION TO Machine Learning

Unformatted text preview:

1INTRODUCTION TOMachine LearningETHEM ALPAYDIN© The MIT Press, 2004Edited for CS 536 Fall 2005 – Rutgers UniversityAhmed ElgammalThis is a draft version. The final version will be posted [email protected]://www.cmpe.boun.edu.tr/~ethem/i2mlLecture Slides forCHAPTER 10:Linear DiscriminationKernel MethodsSupport Vector Machines2Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3Recall - Discriminant Functions()K,,i,giK1 =x()()xxkkiiggC maxif choose =()(){}xxxkkiigg max| ==RK decision regions R1,...,RK()()()()()−=iiiiiCPCpCPRg|||xxxxαLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4Likelihood- vs. Discriminant-based Classification Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x) gi(x) = log P(Ci|x) Discriminant-based: Assume a model for gi(x|Φi); no assumption about the densities; no density estimation Discriminant-based is non-parametric (w.r.t. the class densities) Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries Learning: optimization of the parameters Φito maximize the classification accuracy given labeled training data (or minimize error function) Inductive bias ?3Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5Linear Discriminant Linear discriminant: Advantages: Simple: O(d) space/computation  Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring) Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable()0100|ijdjijiTiiiiwxwww,g +=+=∑=xwwxLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6Generalized Linear Model Quadratic discriminant: Higher-order (product) terms:Map from x to z using nonlinear basis functions and use a linear discriminant in z-space2152242132211 xxz,xz,xz,xz,xz =====()00|iTiiTiiiiww,,g ++= xwxxwx WW() ()∑==kjjjwg1xxφ4Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7Extension to Non-linear Key idea: transform xito a higher dimensional space to “make life easier” Input space: the space containing xi Feature space: the space of φ(xi) after transformation Why transform? Linear operation in the feature space is equivalent to non-linear operation in input space The classification task can be “easier” with a proper transformation. Example: XOR Kernel trick for efficient computationφ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ(.)φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )Feature spaceInput spaceLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8Kernel Trick The relationship between the kernel function K and the mapping φ(.) isThis is known as the kernel trick In practice, we specify K, thereby specifying φ(.) indirectly, instead of choosing φ(.) Intuitively, K(x,y) represents our desired notion of similarity between data x and y and this is from our prior knowledge K(x,y) needs to satisfy a technical condition (Mercer condition) in order for φ(.) to exist5Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9Kernel Trick Define the kernel function K (x,y) as  Consider the following transformation The inner product can be computed by K without going through the map φ(.)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10Example Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width σ Closely related to radial basis function neural networks Sigmoid with parameter κ and θ It does not satisfy the Mercer condition on all κ and θ Research on different kernel functions in different applications is very active6Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11Two Classes()()()()()()( )020102120210121wwwwwgggTTTT+=−+−=+−+=−=xwxwwxwxwxxx()>otherwise0if choose21CgC xLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12Geometry7Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13Multiple Classes()00|iTiiiiww,g += xwwxClasses arelinearly separable() ()xxjKjiiggC1maxif Choose==Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14Pairwise Separation()00|ijTijijijijww,g += xwwx()∈≤∈>=otherwisecare tdon'if 0if 0jiijCCg xxx()0if choose>≠∀ xijig,ijC8Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30Linear Separable CaseClass 1Class 2 Many decision boundaries can separate these two classes Which one should we choose?Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31Bad Decision BoundariesClass 1Class 2Class 1Class 29Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32Margin Should Be Large The decision boundary should be as far away from the data as possible We should maximize the margin, mClass 1Class 2mLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)33Optimal Separating Hyperplane(Cortes and Vapnik, 1995; Vapnik, 1995){}()1as rewritten be can which1 for 11 for 1that such and findif 1if 1 where000021+≥+−=+≤++=+≥+∈−∈+==wrrwrwwCCrr,tTtttTttTttttttxwxwxwwxxxX10Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34Margin Distance from the discriminant to the closest instances on either side Distance of x to the hyperplane is We require For a unique sol’n, fix ρ||w||=1and to max marginwxw0wtT+()t,wrtTt∀ρ≥+wxw0()t,wrtTt∀+≥+ 1 to subject 21 min02xwwLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)35()t,wrtTt∀+≥+ 1 to subject 21 min02xww Standard quadratic optimization problem, whose complexity depends on d11Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36()()[]()00021 121,1 subject to 21


View Full Document
Download INTRODUCTION TO Machine Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view INTRODUCTION TO Machine Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view INTRODUCTION TO Machine Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?