Linear Regression and Artificial Neural Networks Nan$Li$2011.09.27$What$is$Linear$Regression$• Assume$that$Y$(target)$is$a$linear$function$of$X$(features):$• E.g.$• $$• $There$are$other$forms,$but$let’s$consider$this$case$for$now$• Matrix$form:$Rent%Living%area%€ ˆ y =θ0+θ1x1+θ2x2+ ...+θkxk€ ˆ y = xTθWhat$is$a$good$LR$model?$• The$model$that$best$predicts$the$target$Y$given$features$X$• i.e.$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$to$be$close$to$zero$• Our$goal$is$to$seek$$$$$$that$minimize$the$following$cost%function%• Least1Mean1Square%(LSM)$€ ˆ y i(xi) − yi= xiTθ− yi€ J(θ) =12(xiTθ− yi)2i=1n∑How$to$find$the$optimal$$$$$?$$• Gradient$descent$• Converge$to$global$optimum$• Batch:$Guraanteed$convergence$• Online:$Fast$€ θjt +1=θjt−α∂∂θjJ(θ)t=θjt−α∂∂θj12(xiTθt− yi)2i=1n∑=θjt+α(yi− xiTθt)xiji=1n∑How$to$find$the$optimal$$$$$?$$• Directly$minimize$• Take$derivative$and$set$to$zero$€ J(θ)Derivation…$• Elements$€ ∇θtrθTXTXθ= ∇θtrθθTXTX= XTXθ+ (XTX )Tθ= XTXθ+ XTXθ€ trABC = trCAB = trBCA€ ∇AtrABATC = CAB + CTABT € ∇θtr y TXθ= ∇θtrθ y TX= ( y TX)T= XT y € trABC = trCAB = trBCA€ ∇AtrAB = BTMore$Derivation…$ € ∇θJ =12∇θtrθTXTXθ−θTXT y − y TXθ+ y T y ( )=12∇θtrθTXTXθ− 2∇θtr y TXθ+ ∇θtr y T y ( )=12XTXθ+ XTXθ− 2XT y ( )= XTXθ− XT y = 0The%normal%equations%€ ⇒ € ∇θtr y TXθ= XT y € ∇θtrθTXTXθ= XTXθ+ XTXθEquiIvalence$of$LMS$and$MLE$• Assume$• where$ε"follows$a$Gaussian$N(0,σ)$• Then$€ yi=θTxi+εEquivalence$of$LMS$and$MLE,$cont.$• By$$independence$assumption:$• The$log]likelihood$is:$• Recall$that:$• Maximizing$$$$$$$$$$is$equivalent$to$minimizing$$€ L(θ) = p(yi| xi;θ)i=1n∏=12πσexp −(yi−θTxi)22σ2⎛ ⎝ ⎜ ⎞ ⎠ ⎟ i=1n∏=12πσ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ nexp −(yi−θTxi)2i=1n∑2σ2⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ € l(θ) = log L(θ) = n log12πσ−1σ212(yi−θTxi)2i=1n∑€ l(θ)€ J(θ)Ridge$Regression$vs$Lasso$Ridge&Regression:& &&&&&&&Lasso:&&&&&Lasso%(l1%penalty)%results%in%sparse%solu2ons%–%vector%with%more%zero%coordinates%Good%for%high>dimensional%problems%–%don’t%have%to%store%all%coordinates!%βs&with&&constant&&l1&norm&βs&with&constant&J(β)&(level&sets&of&J(β))&βs&with&&constant&&l2&norm&β2&β1&X XBayesian$Interpretation$• Ridge$regression$• Gaussian&Prior:&• Prior&belief&that&β&is&Gaussian&with &zero@mean&biases&soluAon&to&“small”&β&• Lasso$regression$• Laplace$Prior:$• Prior&belief&that&β&is&Laplace&with&zero@mean&biases&soluAon&to&“small”&β $$Something$More…$• LR$with$non]linear$basis$functions$• LR$does$not$mean$we$can$only$deal$with$linear$relationships$• Linear$means$linear$to$θ$• Features$can$be$non]linear$• where$the$φj(x)$are$fixed$basis$functions$(and$we$define$φ0(x)$=$1).$• Weighting$points$• Locally$weighted$LR:$Higher$weights$for$training$examples$closer$to$the$query$point$• Robust$regression:$Higher$weights$for$the$training$examples$that$fit$well$Artificial$Neural$Networks$• A$set$of$connected$perceptrons$that$make$prediction$based$on$unit$inputs$• Let’s$start$with$a$single$perceptron$Perceptron$Learning$• Find$$$$$$$such$that$it$minimizes$sum$of$squared$training$errors$• How?$• Gradient$descent!$• What’s$the$decision$boundary?$• Non]linear?$• Why?$ € wMulti]layer$Neural$Networks$• Highly$flexible:$Non]linear$decision$boundary$Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4 .6 .5 .8 .2 .1 .3 .7 .2 Weights Hidden Layer “Probability of beingAlive” 0.6 Σ Σ .4 .2 ΣHow$to$learn?$• Backpropagation$• An$iterative$method$• For$each$example$• Perform$forward$propagation$• Start$from$output$layer$• Compute$gradient$of$node$with$parents$• Update$weight$based$on$the$gradient$• Convergence$•
View Full Document