Recitation 3Naive Bayes and Logistic Regressionand a surprise...Ekaterina Spriggs, 10701/15781 Fall 2009Friday, September 25, 2009Recitation 3Naive Bayes and Logistic Regressionand decision trees!Ekaterina Spriggs, 10701/15781 Fall 2009Friday, September 25, 2009ClassifiersP (Y |X1, . . . , Xn)P (Y |X1, . . . , Xn)directlyP (Y |X)=P (X|Y )P (Y )/P (X)BayesDiscriminativeGenerativeP (X1, . . . , Xn|Y ),P(Y )Friday, September 25, 2009Generative vs discriminative modelsFriday, September 25, 2009NB decision rulef : X → YMost probable value of f(x)=y:Friday, September 25, 2009NB decision ruleX1, . . . , Xntimecrossover dribbleshoot...crossover dribblex11. . . x1nx21. . . x2n. . .xN1. . . xNny1y2. . .yN,f : X → YMost probable value of f(x)=y:Friday, September 25, 2009NB decision ruleX1, . . . , Xnx11. . . x1nx21. . . x2n. . .xN1. . . xNn,y1y2. . .yNtimef(x11, . . . , x1n)...f : X → YMost probable value of f(x)=y:Friday, September 25, 2009YpredictMAP= arg maxyj∈YP (Y = yj|X = x1, . . . , X = xn)Decision ruleYpredictNB= arg maxyj∈Yn!iP (X = xi|Y = yj)P (Y = yj)YpredictMLE= arg maxyj∈YP (X = x1, . . . , X = xn|Y = yj)f : X → YMost probable value of f(x)=y:Friday, September 25, 2009YpredictNB= arg maxyj∈Yn!iP (X = xi|Y = yj)P (Y = yj)YpredictMLE= arg maxyj∈YP (X = x1, . . . , X = xn|Y = yj)Decision ruleYpredictMAP= arg maxyj∈YP (Y = yj|X = x1, . . . , X = xn)f : X → YMost probable value of f(x)=y:Friday, September 25, 2009YpredictNB= arg maxyj∈Yn!iP (X = xi|Y = yj)P (Y = yj)Bayes optimalYpredictMLE= arg maxyj∈YP (X = x1, . . . , X = xn|Y = yj)Decision ruleYpredictMAP= arg maxyj∈YP (Y = yj|X = x1, . . . , X = xn)f : X → YMost probable value of f(x)=y:Friday, September 25, 2009NB decision ruleYpredictNB= arg maxyj∈Yn!iP (X = xi|Y = yj)P (Y = yj)P (X = xi|Y = yj)=P (Y = yj)=you know this from classyou know this from classMatlab structures...Friday, September 25, 2009% example for NB - how to count%number of data pointsnum_data_points = 9;%number of possible values for the featuresx_num_outcomes = 4;%feature dimensionalityx_dim = 3;%number of possible outcomesy_num_outcomes = 2; Y = [0 0 0 1 1 1 0 0 0]';X = [ 4 3 4 3 4 3 4 3 4; 1 4 3 1 2 1 2 3 1; 3 1 2 4 1 4 1 3 2; ]'; fprintf('Data:\n');[X,Y] %class conditional probability table%(each_feature, y_outcomes, x_outcomes)table = zeros(x_dim, y_num_outcomes, x_num_outcomes); %P(X_i == k, Y = j) = ?% what is this: X(Y ==1, 3) % some code....% .... i = 1; %dim of X j = 1; %values of Y k = 1; %values of X table(i, j, k) = (sum( X(Y==j, i) == k ) + something )/ (sum( Y == j ) + something' );% more code.... %once you have your conditional probability table, how do you make a decision?Friday, September 25, 2009NB decision ruleYpredictNB= arg maxyj∈Yn!iP (X = xi|Y = yj)P (Y = yj)In the homework:Y = 0 or Y =1n!iP (X = xi|Y = 1)P (Y = 1)n!iP (X = xi|Y = 0)P (Y = 0)Comparing:≥Friday, September 25, 2009NB decision ruleYpredictNB= arg maxyj∈Yn!iP (X = xi|Y = yj)P (Y = yj)Problem?Hint: “argmax” a monotonic function of the decision rule0.2*0.3 = alright0.2*0.3*0.8*0.1*0.02*0.7... = troubleFriday, September 25, 2009NB: Gaussian inputs vs discrete inputsGaussian inputs: in classDiscrete inputs: in homework** See suggested reading for Gaussian NBFriday, September 25, 2009NB: performanceClassification performance:not the same as error rate...f(X1,X2)Yprediction truth!I(f(X1,X2)=Y )|Y |Friday, September 25, 2009NB: error ratef(X1,X2)Yf(X1,X2) != YP (X1,X2,Y)!X1,X2,YI(f(X1,X2) != Y )P (X1,X2,Y)prediction truthFriday, September 25, 2009Logistic regressionP (Y =1|X, w)=g(w0+!iwixi)g(w0+!iwixi)=11+ew0+!iwixig(w0+!iwixi)=g(w0+n!i=1wixi)=g(n!i=0wixi)x0=1Friday, September 25, 2009LR: learning the weightsMLE!lnP (DY|DX, w)lnP (DY|DX, w)=!jlnP (yj|xj, w)Gradient ascent: Concave:Log-likelihood:wikipedia or class on MondayFriday, September 25, 2009LR: gradient ascentw ← w + !δδw!jlnP (yi|xi, w)lnP (DY|DX, w)=!jlnP (yj|xj, w)... For all features i:wt+1i← wti+ !!jxji[yj− P (Yj=1|xj, wt)]Until... estimate doesn’t change “much”Friday, September 25, 2009LR: classification ruleClassification rule: see class notesFriday, September 25, 2009Information theoryQuick intro, this will be covered in classFriday, September 25, 2009Information theoryEntropy:H(Y )=−k!i=1P (Y = yi) log2P (Y = yi)Information gain:IG(Y |X)=H(Y ) − H(Y |X)On average, smallest number of bits needed to transmit values drawn from Y’s distributionHow much better can we do if we share knowledge about X?Friday, September 25, 2009Information theoryConditional entropyH(Y |X)=−v!j=1P (X = xj)k!i=1P (Y = yi|X = xj) log2P (Y = yi|X = xj)Friday, September 25, 2009Decision treesID3check wikipediaFriday, September 25, 2009Credits Andrew Moore’s lectures: http://www.autonlab.org/tutorials/ Carlos Guestrin’s lectures: http://select.cs.cmu.edu/class/10701-F09/schedule.htmlTommi Jaakola lecturesVideos: CMU graphics labFriday, September 25, 2009Homework questions?Friday, September 25,
View Full Document