UB CSE 574 - Linear Models for Classification Discriminant Functions - D518470

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Linear Models for Classification Discriminant Functions

DOC PREVIEW

UB CSE 574 - Linear Models for Classification Discriminant Functions

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 32

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 Linear Models for Classification Discriminant Functions Sargur N. Srihari University at Buffalo, State University of New York USATopics • Linear Discriminant Functions – Definition (2-class), Geometry – Generalization to K > 2 classes • Methods to learn parameters 1. Least Squares Classification 2. Fisher’s Linear Discriminant 3. Perceptrons 2 Machine Learning SrihariDiscriminant Function • Assigns input vector x to one of K classes denoted by Ck • Restrict attention to linear discriminants – Decision surfaces are hyperplanes • First consider K = 2, and then extend to K > 2 3 Machine Learning SrihariGeometry of Linear Discriminant Functions: – Two-class linear discriminant function: y(x) = wT x + w0 A linear function of input vector • w is weight vector and w0 is bias, also called threshold Assign x to class C1 if y(x) ≥ 0 else class C2 – Defines decision boundary as y(x) = 0 • w determines orientation of surfaceDistance of Origin to Surface Let xA and xB be points on the surface • Because y(xA)=y(xB)=0, we have wT(xA-xB)=0, • Thus w is orthogonal to every vector lying on decision surface • So w determines orientation of surface – If x is a point on surface then y(x)=0 or wTx = -w0 • Normalized distance from origin to surface: – Where elements of w are normalized by dividing by its norm ||w|| » By definition of normalized vector – w0 sets distance of origin to surface wTx|| w ||= −w0|| w ||where ||w|| is the norm defined as|| w ||2= wTw = w12+ .. + wM −12Distance of arbitrary point x to surface Let x be an arbitrary point – We can show that y(x) gives signed measure of perpendicular distance r from x to surface as follows: • If xp is orthogonal projection of x to surface then x2x1wxy(x)⇤w⇤x⇥w0⇤w⇤y =0y<0y>0R2R1 x = xp+ rw|| w || by vector additionSecond term is a vector normal to surface.This vector is parallel to w which is normalized by length ||w|| .Since a normalized vector has length 1 we need to scale by r.From which we can getr =y(x)|| w ||Augmented vector Machine Learning Srihari 7 • With dummy input x0=1 • and w =(w0,w) then y(x) = wTx – passes through origin in augmented D+1 dimensional spaceExtension to Multiple Classes • Two approaches: – Using several two-class classifiers • But leads to serious difficulties – Use K linear functions Machine Learning Srihari 8Multiple Classes with 2-class classifiers • By using several 2-class classifiers 9 R1R2R3?C1not C1C2not C2Build a K class discriminant Use K − 1 classifiers, each solve a two-class problem Alternative is K(K − 1)/2 binary discriminant functions, one for every pair R1R2R3?C1C2C1C3C2C3One-versus-the-rest One-versus-one Both result in ambiguous regions of input space Machine Learning SrihariMultiple Classes with K discriminants 10 • Consider a single K class discriminant of the form • yk(x) = wTk x + wk0 • Assign a point x to class Ck if yk(x) > yj (x) for all j ≠ k • Decision boundary between class Ck and Cj is given by yk(x) = yj (x) – This corresponds to D − 1 dimensional hyperplane defined by – (wk − wj )T x + (wk0 − wj0 ) = 0 – Same form as the decision boundary for 2-class case wTx + w0=0 • Decision regions of such a discriminant are always singly connected and convex – Proof follows Machine Learning SrihariConvexity of Decision Regions (Proof) 11 RiRjRkxAxBxThus Rk is singly-connected and convex (single straight line connects any two points in region)  ˆ x =λxA+ (1−λ)xBConsider two points xA and xB both in decision region Rk Any point on line connecting xA and xB can be expressed as From linearity of discriminant functions yk(x) = wTk x + wk0  yk(ˆ x ) =λyk(xA) + (1−λ)yk(xB)Because xA and xB lie inside Rk it follows that yk(xA) > yj(xA) and yk(xB) > yj(xB) for all j ≠ k. Hence also lies inside Rk Machine Learning Srihari where 0 < l <1. Combining the two, we have  ˆ x  ˆ xLearning the Parameters of Linear Discriminant Functions • Three Methods – Least Squares – Fisher’s Linear Discriminant – Perceptrons • Each is simple but several disadvantages 12 Machine Learning SrihariLeast Squares for Classification • Analogous to regression: simple closed- form solution exists for parameters • Each Ck , k=1,..K is described by its own linear model yk(x) = wTk x + wk0 • Create augmented vector x=(1,xT) and wk=(wk0,wkT) • Grouping into vector notation y(x) = WT x W is the parameter matrix whose kth column is a D +1 dimensional vector (including bias) • New input vector x is assigned to class for which output yk = wTk x is largest • Determine W by minimizing squared error Machine Learning SrihariParameters using Least Squares • Training data {xn, tn}, n = 1, . . . , N • Define matrices T ≡ nth row is the vector tTn X ≡ nth row of which is xnT • Sum of squares error function ED(W) = ½ Tr {(XW − T)T (XW − T)} 14 tn is a column vector of K dimensions using 1- of – K form Notes: (XW-T) is error vector, whose square is a diagonal matrix Trace is the sum of diagonal elements Machine Learning SrihariMinimizing Sum of Squares • Sum of squares error function ED(W) = ½ Tr {(XW − T)T (XW − T)} • Set derivative w.r.t. W to zero, gives solution W =(XT X)−1 X TT = X †T where X † is pseudo-inverse of matrix X • Discriminant function, after rearranging, is y(x) = WT x = TT(X†)Tx 15 Machine Learning Srihari • An exact closed form solution for W using which we can classify x to class k for which yk is maximum but has severe limitationsLeast Squares is Sensitive to Outliers !4 !2 0 2 4 6 8!8!6!4!202416 !4 !2 0 2 4 6 8!8!6!4!2024Magenta: Least Squares Green: Logistic Regression (more robust) Machine Learning Srihari Sum of squared errors penalizes predictions that are “too correct” Or long way from decision boundary SVMs have an alternate error function (hinge function) that does not have this limitationDisadvantages of Least Squares !6 !4 !2 0 2 4 6!6!4!20246!6 !4 !2 0 2 4 6!6!4!20246• Lack robustness to outliers • Certain datasets unsuitable for least squares classification • Decision boundary corresponds to ML solution under Gaussian

View Full Document