1CS 2750 Machine LearningCS 2750 Machine LearningLecture 8Milos [email protected] Sennott SquareLinear regression (cont.)Linear methods for classificationCS 2750 Machine LearningCoefficient shrinkage• The least squares estimates often have low bias but high variance • The prediction accuracy can be often improved by setting some coefficients to zero– Increases the bias, reduces the variance of estimates• Solutions:– Subset selection– Ridge regression– Principal component regression• Next: ridge regression2CS 2750 Machine LearningRidge regression• Error function for the standard least squares estimates: • We seek: • Ridge regression:•Where• What does the new error function do? 2,..1)(1)(iTininynJ xww −=∑=2,..1*)(1minargiTiniynxwww−=∑=22,..1)(1)( wxwwλ+−=∑=iTininynJ∑==diiw022w0≥λandCS 2750 Machine LearningRidge regression• Standard regression:• Ridge regression:• penalizes non-zero weights with the costproportional to (a shrinkage coefficient) • If an input attribute has a small effect on improving the error function it is “shut down” by the penalty term• Inclusion of a shrinkage penalty is often referred to as regularization2,..1)(1)(iTininynJ xww −=∑=22,..1)(1)( wxwwλ+−=∑=iTininynJ∑==diiw022wλjx3CS 2750 Machine LearningSupervised learningData: a set of n examples is input vector, and y is desired output (given by a teacher)Objective: learn the mapping s.t.Two types of problems:• Regression: Y is continuousExample: earnings, product orders company stock price•Classification: Y is discreteExample: temperature, heart rate diseaseToday: binary classification problems:},..,,{21 ndddD =>=<iiiyd ,xixYXf →:nixfyii,..,1allfor)(=≈CS 2750 Machine LearningBinary classification• Two classes• Our goal is to learn to classify correctly two types of examples– Class 0 – labeled as 0, – Class 1 – labeled as 1• We would like to learn• Zero-one error (loss) function• Error we would like to minimize:•First step: we need to devise a model of the function }1,0{=Y}1,0{: →Xf=≠=iiiiiiyfyfyError),(0),(1),(1wxwxx)),((1),(yErrorEyxx4CS 2750 Machine LearningDiscriminant functions• One convenient way to represent classifiers is through – Discriminant functions• Works for binary and multi-way classification• Idea: – For every class i = 0,1, …k define a functionmapping– When the decision on input x should be made choose the class with the highest value of• So what happens with the input space? Assume a binary case. )(xigℜ→X)(xigCS 2750 Machine LearningDiscriminant functions)()(01xx gg ≥-2 -1.5 -1 -0.5 0 0.5 1 1.5-2-1.5-1-0.500.511.525CS 2750 Machine LearningDiscriminant functions)()(01xx gg ≥)()(01xx gg≤-2 -1.5 -1 -0.5 0 0.5 1 1.5-2-1.5-1-0.500.511.52)()(01xx gg≤CS 2750 Machine LearningDiscriminant functions)()(01xx gg ≥)()(01xx gg≤-2 -1.5 -1 -0.5 0 0.5 1 1.5-2-1.5-1-0.500.511.52)()(01xx gg≤)()(01xx gg ≥6CS 2750 Machine LearningDiscriminant functions• Define decision boundary. )()(01xx gg ≥)()(01xx gg≤-2 -1.5 -1 -0.5 0 0.5 1 1.5-2-1.5-1-0.500.511.52)()(01xx gg ≥)()(01xx gg≤)()(01xx gg=CS 2750 Machine LearningQuadratic decision boundary-2 -1.5 -1 -0.5 0 0.5 1 1.5-2-1.5-1-0.500.511.522.53Decision boundary)()(01xx gg ≥)()(01xx gg ≤)()(01xx gg=7CS 2750 Machine LearningLogistic regression model• Defines a linear decision boundary• Discriminant functions:• where)()()(1xwxwwx,TTggf ==)1/(1)(zezg−+=xInput vector∑11x)( wx,f0w1w2wdw2xzdxLogistic function)()(1xwxTgg = )(1)(0xwxTgg −=- is a logistic functionCS 2750 Machine LearningLogistic functionfunction• also referred to as a sigmoid function• Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1])1(1)(zezg−+=-20 -15 -10 -5 0 5 10 15 2000.10.20.30.40.50.60.70.80.918CS 2750 Machine LearningLogistic regression model• Discriminant functions:• Where• Values of discriminant functions vary in [0,1]– Probabilistic interpretation)1/(1)(zezg−+=),|1( wx=ypxInput vector∑11x0w1w2wdw2xzdx)()(1xwxTgg = )(1)(0xwxTgg −=- is a logistic function)()(),|1()(1xwxxwwx,Tggypf ====CS 2750 Machine LearningLogistic regression• Instead of learning the mapping to discrete values 0,1 •we learn a probabilistic function–where f describes the probability of class 1 given xNote that:• Transformation to discrete class values:),|1()( wxwx,==ypf]1,0[: →Xf}1,0{: →Xf2/1)|1( ≥=xypIf then choose 1Else choose 0)|1(1),|0( wx,wx=−==ypyp9CS 2750 Machine LearningLinear decision boundary• Logistic regression model defines a linear decision boundary• Why?• Answer: Compare two discriminant functions.• Decision boundary:• For the boundary it must hold: 0)()(1log)()(log1=−=xwxwxxTTggggo)()(01xx gg=0)(explog)(exp11)(exp1)(explog)()(log1==−=−+−+−= xwxwxwxwxwxxTTTTTggoCS 2750 Machine LearningLogistic regression model. Decision boundary• LR defines a linear decision boundaryExample: 2 classes (blue and red points)-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2-1.5-1-0.500.511.52Decision boundary10CS 2750 Machine LearningLikelihood of outputs• Let• Then• Find weights w that maximize the likelihood of outputs– Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihoodLogistic regression: parameter learning.=−=−=∑∏=−=−niyiyiniyiyiiiiiDl1111)1(log)1(log),(µµµµw∏∏=−=−===niyiyiiiniiiyyPDL111)1(),|(),(µµwxw)()(),|1( xwwxTiiiigzgyp ====µ)1log()1(log1iiiniiyyµµ−−+=∑=>=<iiiyD ,xCS 2750 Machine LearningLogistic regression: parameter learning• Log likelihood• Derivatives of the loglikelihood• Gradient descent:)1log()1(log),(1iiiniiyyDlµµ−−+=∑=w)),(())((),(11iiiniiTiinifygyDl xwxxwxww−−=−−=−∇∑∑==)1(|)],([)()1()(−−∇−←−kDlkkkwwwwwαNonlinear in weights !!∑=−−−+←niiikikkfyk1)1()1()()],([)( xxwwwα))((),(1, iinijijzgyxDlw−−=∂∂−∑=w11CS 2750 Machine LearningLogistic regression. Online gradient descent• On-line component of the loglikelihood• On-line learning update for weight w• ith update for the logistic regression and)1(|)],([)()1()(−∇−←−kkonlinekkDJkwwwwwα),( wkonlineDJ>=<kkkyD ,xkkkikifyk xxwww )],()[()1()1()(
View Full Document