1CS 2710 Foundations of AICS 2710 Foundations of AILecture 22Milos [email protected] Sennott SquareSupervised learning.Linear and logistic regression.CS 2710 Foundations of AISupervised learningData: a set of n examples is an input vector of size dis the desired output (given by a teacher)Objective: learn the mapping s.t.• Regression: Y is continuousExample: earnings, product orders company stock price• Classification: Y is discreteExample: handwritten digit digit label},..,,{21 nDDDD =>=<iiiyD ,xYXf →:nifyii,..,1allfor)(=≈ x),,(,2,1, diiiixxx L=xiy2CS 2710 Foundations of AILinear regression.CS 2710 Foundations of AILinear regression• Function is a linear combination of input components∑=+=+++=djjjddxwwxwxwxwwf1022110)( Kxkwww K,,10- parameters (weights)∑11x),( wxf0w1w2wdwdx2xxInput vectorBias termYXf →:3CS 2710 Foundations of AILinear regression• Shorter (vector) definition of the model– Include bias constant in the input vectorxwxTddxwxwxwxwf =+++= K221100)(kwww K,,10- parameters (weights)∑11x),( wxf0w1w2wdwdx2xxInput vector),,,1(21 dxxx L=xCS 2710 Foundations of AILinear regression. Error.• Data:• Function:• We would like to have• Error function– measures how much our predictions deviate from the desired answers• Learning: We want to find the weights minimizing the error !2,..1))((1iininfynJ x−=∑=>=<iiiyD ,x)(iif xx →nifyii,..,1allfor)(=≈xMean-squared error4CS 2710 Foundations of AILinear regression. Example• 1 dimensional input-1.5 -1 -0.5 0 0.5 1 1. 5 2-15-10-5051015202530)(1x=xCS 2710 Foundations of AILinear regression. Example.• 2 dimensional input-3-2-10123-4-2024-20-15-10-505101520),(21xx=x5CS 2710 Foundations of AILinear regression. Optimization.•We want the weights minimizing the error• For the optimal set of parameters, derivatives of the error withrespect to each parameter must be 0• Vector of derivatives:2,..12,..1)(1))((1iTiniiininynfynJ xwx −=−=∑∑==0xxwwwww=−−=∇=∑=iiTininnynJJ )(2))(())((grad10)(2)(,,1,10,01=−−−−−=∂∂∑=jididiiininjxxwxwxwynJwKwCS 2710 Foundations of AISolving linear regressionBy rearranging the terms we get a system of linear equationswith d+1 unknowns0)(2)(,,1,10,01=−−−−−=∂∂∑=jididiiininjxxwxwxwynJwKwjiniijinididjinijijjiniijiniixyxxwxxwxxwxxw,1,1,,1,,11,1,10,0∑∑∑∑∑======+++++ KKbAw=1111111,1,11,110,0∑∑∑∑∑======+++++niinididnijijniiniiyxwxwxwxw KK1,11,1,1,1,1,11,11,10,0 iniiinididinijijiniiiniixyxxwxxwxxwxxw∑∑∑∑∑======+++++ KK6CS 2710 Foundations of AISolving linear regression• The optimal set of weights satisfies:Leads to a system of linear equations (SLE) with d+1unknowns of the formSolution to SLE:• matrix inversion0xxwww=−−=∇∑=iiTininynJ )(2))((1jiniijinididjinijijjiniijiniixyxxwxxwxxwxxw,1,1,,1,,11,1,10,0∑∑∑∑∑======+++++ KKbAw=bAw1−=CS 2710 Foundations of AIGradient descent solutionGoal: the weight optimization in the linear regression modelAn alternative to SLE solution: • Gradient descentIdea:– Adjust weights in the direction that improves the Error– The gradient tells us what is the right direction-a learning rate (scales the gradient changes))(wwww iError∇−←α2,..1)),((1)( wxwiininfynErrorJ −==∑=0>α7CS 2710 Foundations of AIGradient descent method• Descend using the gradient information• Change the value of w according to the gradientw*|)(wwwError∇*w)( wError)(wwww iError∇−←αDirection of the descentCS 2710 Foundations of AIGradient descent method• New value of the parameter- a learning rate (scales the gradient changes)w*|)(wwErrorw∂∂*w)(wError*|)(*wjjjwErrorwww∂∂−←α0>αFor all j8CS 2710 Foundations of AIGradient descent method• Iteratively converge to the optimum of the Error functionw)0(w)(wError)2(w)1(w)3(wCS 2710 Foundations of AIOnline gradient algorithm• The error function is defined for the whole dataset D• error for a sample• Online gradient method: changes weights after every sample• vector form: 2,..1)),((1)( wxwiininfynErrorJ −==∑=2online)),((21)( wxwiiifyErrorJ −==)(wijjjErrorwww∂∂−←α0>α- Learning rate that depends on the number of updates)(wwww iError∇−←α>=<iiiyD ,x9CS 2710 Foundations of AIOnline gradient method2)),((21)( wxwiiionlinefyErrorJ −==(i)-th update step with :xwxTf =)(Linear modelOn-line errorii1)( ≈αAnnealed learning rate:- Gradually rescales changes in weightsOn-line algorithm: generates a sequence of online updates)1(|)()()1()(−∂∂−←−ijiijijwErroriwwwwαj-th weight:>=<iiiyD ,xjiiiiijijxfyiww,)1()1()()),()((−−−+← wxαCS 2710 Foundations of AIOnline regression algorithmOnline-linear-regression (D, number of iterations)Initialize weightsfor i=1:1: number of iterationsdo select a data point from Dset update weight vectorend forreturn weights),,(210 dwwww K=wi/1=αiiify xwxww )),((−+←α),(iiiyD x=w• Advantages: very easy to implement, continuous data streams,adapts to changes in the model over time10CS 2710 Foundations of AIOn-line learning. Example-3 -2 -1 0 1 2 311.522.533.544.5-3 -2 -1 0 1 2 311.522.533.544.5-3 -2 -1 0 1 2 30.511.522.533.544.555.5-3 -2 -1 0 1 2 30.511.522.533.544.555.51234CS 2710 Foundations of AILogistic regression11CS 2710 Foundations of AIBinary classification• Two classes• Our goal is to learn how to correctly classify two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1• We would like to learn• First step: we need to devise a model of the function f• Inspiration:neuron (nerve cells)}1,0{=Y}1,0{: →XfCS 2710 Foundations of AINeuron• neuron (nerve cell) and its activities12CS 2710 Foundations of AINeuron-based binary classification model∑0w1w2wkwzx1yThreshold functionCS 2710 Foundations of AINeuron-based binary classification• Function we want to learnxInput vectorBias term∑1)1(x0w1w2wkw)2(xz)(kxThreshold functiony}1,0{: →Xf13CS 2710 Foundations of AILogistic regression model• A function model with smooth switching:)...()()()1(10kkxwxwwgf ++=x)1/(1)(zezg−+=where w are parameters of the modelsand g(z) is a logistic functionxInput vectorBias term∑1)1(x0w1w2wkw)2(xz)(kxLogistic function)(xf]1,0[)(∈xfCS 2710 Foundations of AILogistic functionfunction• also referred to as sigmoid function• replaces the ideal threshold function with smooth
View Full Document