Rutgers University CS 536 - INTRODUCTION TO Machine Learning - D393126

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> INTRODUCTION TO Machine Learning

Rutgers University CS 536 - INTRODUCTION TO Machine Learning

School name Rutgers University- The State University of New Jersey

Pages 23

Download Save

Unformatted text preview:

1INTRODUCTION TOMachine LearningETHEM ALPAYDIN© The MIT Press, 2004Edited for CS 536 Fall 2005 – Rutgers UniversityAhmed ElgammalThis is a draft version. The final version will be posted [email protected]://www.cmpe.boun.edu.tr/~ethem/i2mlLecture Slides forCHAPTER 10:Linear DiscriminationKernel MethodsSupport Vector Machines2Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3Recall - Discriminant Functions()K,,i,giK1 =x()()xxkkiiggC maxif choose =()(){}xxxkkiigg max| ==RK decision regions R1,...,RK()()()()()−=iiiiiCPCpCPRg|||xxxxαLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4Likelihood- vs. Discriminant-based Classification Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x) gi(x) = log P(Ci|x) Discriminant-based: Assume a model for gi(x|Φi); no assumption about the densities; no density estimation Discriminant-based is non-parametric (w.r.t. the class densities) Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries Learning: optimization of the parameters Φito maximize the classification accuracy given labeled training data (or minimize error function) Inductive bias ?3Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5Linear Discriminant Linear discriminant: Advantages: Simple: O(d) space/computation  Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring) Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable()0100|ijdjijiTiiiiwxwww,g +=+=∑=xwwxLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6Generalized Linear Model Quadratic discriminant: Higher-order (product) terms:Map from x to z using nonlinear basis functions and use a linear discriminant in z-space2152242132211 xxz,xz,xz,xz,xz =====()00|iTiiTiiiiww,,g ++= xwxxwx WW() ()∑==kjjjwg1xxφ4Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7Extension to Non-linear Key idea: transform xito a higher dimensional space to “make life easier” Input space: the space containing xi Feature space: the space of φ(xi) after transformation Why transform? Linear operation in the feature space is equivalent to non-linear operation in input space The classification task can be “easier” with a proper transformation. Example: XOR Kernel trick for efficient computationφ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ(.)φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )Feature spaceInput spaceLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8Kernel Trick The relationship between the kernel function K and the mapping φ(.) isThis is known as the kernel trick In practice, we specify K, thereby specifying φ(.) indirectly, instead of choosing φ(.) Intuitively, K(x,y) represents our desired notion of similarity between data x and y and this is from our prior knowledge K(x,y) needs to satisfy a technical condition (Mercer condition) in order for φ(.) to exist5Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9Kernel Trick Define the kernel function K (x,y) as  Consider the following transformation The inner product can be computed by K without going through the map φ(.)Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10Example Kernel Functions Polynomial kernel with degree d Radial basis function kernel with width σ Closely related to radial basis function neural networks Sigmoid with parameter κ and θ It does not satisfy the Mercer condition on all κ and θ Research on different kernel functions in different applications is very active6Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11Two Classes()()()()()()( )020102120210121wwwwwgggTTTT+=−+−=+−+=−=xwxwwxwxwxxx()>otherwise0if choose21CgC xLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12Geometry7Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13Multiple Classes()00|iTiiiiww,g += xwwxClasses arelinearly separable() ()xxjKjiiggC1maxif Choose==Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14Pairwise Separation()00|ijTijijijijww,g += xwwx()∈≤∈>=otherwisecare tdon'if 0if 0jiijCCg xxx()0if choose>≠∀ xijig,ijC8Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30Linear Separable CaseClass 1Class 2 Many decision boundaries can separate these two classes Which one should we choose?Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31Bad Decision BoundariesClass 1Class 2Class 1Class 29Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32Margin Should Be Large The decision boundary should be as far away from the data as possible We should maximize the margin, mClass 1Class 2mLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)33Optimal Separating Hyperplane(Cortes and Vapnik, 1995; Vapnik, 1995){}()1as rewritten be can which1 for 11 for 1that such and findif 1if 1 where000021+≥+−=+≤++=+≥+∈−∈+==wrrwrwwCCrr,tTtttTttTttttttxwxwxwwxxxX10Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34Margin Distance from the discriminant to the closest instances on either side Distance of x to the hyperplane is We require For a unique sol’n, fix ρ||w||=1and to max marginwxw0wtT+()t,wrtTt∀ρ≥+wxw0()t,wrtTt∀+≥+ 1 to subject 21 min02xwwLecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)35()t,wrtTt∀+≥+ 1 to subject 21 min02xww Standard quadratic optimization problem, whose complexity depends on d11Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36()()[]()00021 121,1 subject to 21

View Full Document


School:
Email:
New Password:
Confirm Password:

Rutgers University CS 536 - INTRODUCTION TO Machine Learning

Sign up for free to view:

Please select your school