STEVENS CS 559 - CS 559 13th Set of Notes (Recap of Sets 8-12) - D2189963

Home> Schools> Stevens Institute of Technology> (CS) > CS 559> CS 559 13th Set of Notes (Recap of Sets 8-12)

STEVENS CS 559 - CS 559 13th Set of Notes (Recap of Sets 8-12)

School name Stevens Institute of Technology

Course Cs 559- Machine Learning Syllabus

Pages 97

Download Save

Unformatted text preview:

1CS 559: Machine Learning Fundamentals and Applications13thSet ofNotes13Set of Notes (Recap of Sets 8-12)Instructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiEilPhili M d h i@ t dE-mail: [email protected]: Lieb 215Pattern Classification, Chapter 1OutlineOutline• All topics covered in 7thset of Notes (id )(midterm recap)• Non-parametric Classification• Linear Discriminant Functions•Support Vector Machines (SVM)Support Vector Machines (SVM)• Ensemble MethodsUidLi•Unsupervised Learning• Other notes2Density Estimation– Basic idea:P b bili h ill f ll i i R i–Probability that a vector x will fall in region R is: (1) ')'( dxxpP– P is a smoothed (or averaged) version of the density function p(x) if we have a sample of size n; therefore, the probability that k points fall in R is then:probability that k points fall in R is then:(2) )1( knkkPPknPand the expected value for kis:E(k) = nP (3)Pattern Classification, Chapter 4 3ML estimation of P = is reached forPkˆ)|P(Maxis reached forTh f th tik/i d ti t f thPn)|P(MaxkTherefore, the ratio k/nis a good estimate for the probability Pand hence for the density function p(for largen)(for large n) p(x)is continuous andtheregionRis so small thatp(x)is continuous and the region Ris so small that p does not vary significantly within it, we can write:wherexisa point withinRand V the volume enclosed byR(4) V)x(p'dx)'x(pwhere xis a point within Rand V the volume enclosed by R.Pattern Classification, Chapter 4 4pn(x) = (kn/n)/VnThere are two different ways of obtaining sequences of regions that satisfy the conditions for converging to the true probability given a large number of samples.(a) Shrink an initial region whereV=1/nand show that(a) Shrink an initial region where Vn= 1/nand show that )x(p)x(pnnThis is called “the Parzen-window estimation method”(b) Specifyknas some function of n, such as kn= n; the volume Vnis grown until ()p yn,n;ngit encloses knneighbors of x. This is called “the kn-nearest neighbor estimation method”Pattern Classification, Chapter 4 5ParzenWindowsParzenWindows– The Parzen-window approach to estimate densitiesassumesthat the regionRis a d-densities assumes that the region Rnis a ddimensional hypercube ) ofedge the of length :(h hVnndnnd1j1u1:functionwindow following the be (u) Letotherwise0 d , 1,...j 2u 1(u)j– ((x-xi)/hn) is equal to unity if xifalls within the hypercube of volume Vncentered at x and equal to zerootherwiseto zero otherwisePattern Classification, Chapter 4 6– The number of samples in this hypercube is:niiinhxxk1By substituting knin equation (7), we obtain the following estimate:inh1estimate:ini1inhxx V1n1)x(pPn(x)estimates p(x)as an average of functions of xand th l()(i1)Th f tibln1inhVnthe samples (xi) (i= 1,…,n).These functions can be generalPattern Classification, Chapter 4 7Window FunctionsWindow Functions•Conditions for estimating legitimateConditions for estimating legitimate density functionNonnegative0(x)–Non-negative– Integrate to 11)(d0(x)1)(dxx• In other words, the window function should be a probability density function Pattern Classification, Chapter 4 8Illustration• The behavior of the Parzen-window methodmethod– Case where p(x) N(0,1)–Let(u) = (1/(2)exp(-u2/2) andh=h/n–Let (u) = (1/(2) exp(-u/2) and hn= h1/n (n>1) • (h1: known parameter)1Thus:nini1innhxx h1n1)x(pis an average of normal densities centered at thesamplesxin1inthe samples xiPattern Classification, Chapter 4 9K-Nearest Neighbor EstimationK Nearest Neighbor Estimation• Goal: a solution for the problem of the unknown “best” window ftifunction– Let the cell volume be a function of the training data– Center a cell about x and let it grow until it captures knsamples (k=f(n))(kn f(n))–knare called theknnearest-neighbors of x•Benefits•Benefits– If density is high near x, the cell will be small which provides a good resolution–If density is low, the cell will grow large and stop when higher y,gg pgdensity regions are reachedWe can obtain a family of estimates by setting kn=k1/nand choosing different values fork1choosing different values for k1Pattern Classification, Chapter 4 10Estimation of Posterior ProbabilitiesEstimation of Posterior Probabilities• Goal: estimate P(i| x)from a set of n labeled samples(i|)p• Place a cell of volume V around x and capture k samples•kisamples amongst k turned out to be labeled ithen: pn(x, i) = ki/nVAn estimate for pn(i| x)is:k),x(p)x|(pijinink),x(p)|(pcj1jjninPattern Classification, Chapter 4 11OutlineOutline• All topics covered in 7thset of Notes (id )(midterm recap)• Non-parametric Classification• Linear Discriminant Functions•Support Vector Machines (SVM)Support Vector Machines (SVM)• Ensemble MethodsUidLi•Unsupervised Learning• Other notes12Linear Discriminant Functions: Basic Idea• Have samples from 2 classes x1, x2,…, xnp1,2,,n• Assume 2 classes can be separated by a linear boundary l(q) with some unknown parameters q•Fit the“best”boundary to data by optimizing over•Fit the best boundary to data by optimizing over parameters q• What is best?Minimize classification error on training data?–Minimize classification error on training data?– Does not guarantee small testing errorPattern Classification, Chapter 5 13Parametric Methods vs. Di i i F iDiscriminant Functions• Assume the shape of difl i• Assume discriminant fi fkdensity for classes is known p1(x|θ1), p2(x| θ2),…• Estimate θ1, θ2,… from dtfunctions are of known shape l(θ1), l(θ2), with parameters θ1, θ2,…Eti tθθfdata• Use a Bayesian classifier to find decision regions•Estimate θ1, θ2,… from data• Use discriminant functions for classificationfunctions for classificationPattern Classification, Chapter 5 14LDF: Two ClassesLDF: Two Classes• A discriminant function is linear if it can be written asg(x) = wtx+ w0•w is called the weight vector and w0is called the bias or thresholdor thresholdPattern Classification, Chapter 5 15LDF: Multiple ClassesLDF: Multiple Classes•Suppose we

View Full Document


School:
Email:
New Password:
Confirm Password:

STEVENS CS 559 - CS 559 13th Set of Notes (Recap of Sets 8-12)

Sign up for free to view:

Please select your school