1Machine LearningMachine Learning1010--701/15701/15--781, Fall 2008781, Fall 2008Introduction to RegressionIntroduction to RegressionEric XingEric XingLecture 4, January 28, 2006Reading: Chap. 3, CBFunctional Approximation, cont.2Machine learning for apartment hunting z Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs:square-ft., # of bedroom, distance to campus …?1.5270…?1150500110911002433100025066001230Rent ($)# bedroomLiving area (ft2)The learning problemz Features: z Living area, distance to campus, # bedroom …z Denote as x=[x1, x2, … xk]z Target: z Rentz Denoted as yz Training set:rentLiving arearentLiving areaLocation⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−−−−−−−=knnnkknxxxxxxxxx,,,,,,,,,KMMMMKKMMM21222121211121xxxX⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=nnyyyyyyYMMMM2121or3Linear Regressionz Assume that Y (target) is a linear function of X (features):z e.g.:z let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:z then we have the following general representation of the linear function:z Our goal is to pick the optimal . How!z We seek that minimize the following cost function:22110xxyθθθ++=ˆθθ∑=−=niiiiyxyJ1221))(ˆ()(vθThe Least-Mean-Square (LMS) methodz The Cost Function:z Consider a gradient descent algorithm:∑=−=niiTiyJ1221)()(θθxtjtjtjJ )(θθαθθ∂∂−=+14The Least-Mean-Square (LMS) methodz Now we have the following descent rule: z For a single training point, we have: z This is known as the LMS update rule, or the Widrow-Hoff learning rulez This is actually a "stochastic", "coordinate" descent algorithmz This can be used as a on-line algorithm∑=+−+=niintTnntjtjxy11,)(θαθθxThe Least-Mean-Square (LMS) methodz Steepest descentz Note that:z This is as a batch gradient descent algorithm∑=−−=⎥⎦⎤⎢⎣⎡∂∂∂∂=∇ninTnnTkyJJJ11xx )(,,θθθθK∑=+−+=nintTnntty11xx )(θαθθ5Some matrix derivativesz For , define:z Trace:z Some fact of matrix derivatives (without proof)⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎣⎡∂∂∂∂∂∂∂∂=∇fAfAfAfAAfmnmnALMOML1111)(RR anmf×:, tr∑==niiiAA1, traa=BCACABABC trtrtr==, trTABAB =∇, trTTTAABCCABCABA +=∇() TAAAA1−=∇The normal equationsz Write the cost function in matrix form:z To minimize J(θ), take derivative and set to zero:()()()yyXyyXXXyXyXyJTTTTTTTniiTivvvvvv+−−=−−=−=∑=θθθθθθθθ21212112)()( x⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−−−−−−−=nxxxXMMM21⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=nyyyyMv21()()()022122121=−=−+=∇+∇−∇=+−−∇=∇yXXXyXXXXXyyXyXXyyXyyXXXJTTTTTTTTTTTTTTTvvvvvvvvvθθθθθθθθθθθθθθθtrtrtrtryXXXTTv=⇒θ The normal equations()yXXXTTv1−=*θ⇓6A recap:z LMS update rulez Pros: on-line, low per-step costz Cons: coordinate, maybe slow-convergingz Steepest descentz Pros: fast-converging, easy to implementz Cons: a batch, z Normal equationsz Pros: a single-shot algorithm! Easiest to implement.z Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues (e.g., matrix is singular ..)intTnntjtjxy,)(θαθθx−+=+1∑=+−+=nintTnntty11xx )(θαθθ()yXXXTTv1−=*θGeometric Interpretation of LMSz The predictions on the training data are:z Note thatand is the orthogonal projection ofinto the space spanned by the columns of X()yXXXXXyTTvv1−==*ˆθ()()yIXXXXyyTTvvv−=−−1ˆ()()()()()011=−=−=−−−yXXXXXXyIXXXXXyyXTTTTTTTTvvvvˆ!!yˆvyv⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−−−−−−−=nxxxXMMM217Probabilistic Interpretation of LMSz Let us assume that the target variable and the inputs are related by the equation:where ε is an error term of unmodeled effects or random noisez Now assume that ε follows a Gaussian N(0,σ), then we have:z By independence assumption:iiTiyεθ+= x⎟⎟⎠⎞⎜⎜⎝⎛−−=22221σθσπθ)(exp);|(iTiiiyxypx⎟⎟⎠⎞⎜⎜⎝⎛−−⎟⎠⎞⎜⎝⎛==∑∏==2121221σθσπθθniiTinniiiyxypL)(exp);|()(xProbabilistic Interpretation of LMS, cont.z Hence the log-likelihood is:z Do you recognize the last term?Yes it is: z Thus under independence assumption, LMS is equivalent to MLE of θ !∑=−−=niiTiynl12221121)(log)( xθσσπθ∑=−=niiTiyJ1221)()(θθx8Beyond basic LRz LR with non-linear basis functionsz Locally weighted linear regressionz Regression trees and Multilinear InterpolationLR with non-linear basis functionsz LR does not mean we can only deal with linear relationshipsz We are free to design (non-linear) features under LRwhere the φj(x) are fixed basis functions (and we define φ0(x) = 1).z Example: polynomial regression:z We will be concerned with estimating (distributions over) the weights θ and choosing the model order M.)()( xxyTmjjφθφθθ=+=∑=10[]321 xxxx ,,,:)( =φ9Basis functionsz There are many basis functions, e.g.:z Polynomialz Radial basis functionsz Sigmoidalz Splines, Fourier, Wavelets, etc1−=jjxx)(φ()⎟⎟⎠⎞⎜⎜⎝⎛−−=222sxxjjµφexp)(⎟⎟⎠⎞⎜⎜⎝⎛−=sxxjjµσφ)(1D and 2D RBFsz 1D RBFz After fit:10Good and Bad RBFsz A good 2D RBFz Two bad 2D RBFsLocally weighted linear regressionz Overfitting and underfittingxy10θθ+=2210xxyθθθ++=∑==50jjjxyθ11Bias and variancez we define the bias of a model to be the expected generalization error even if we were to t it to a very (say, infinitely) large training set.z By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.Locally weighted linear regressionz The algorithm:Instead of minimizingnow we fit θ to minimizeWhere do wi's come from? z where x is the query point for which we'd like to know its corresponding yÆ Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)z Do we also have a probabilistic interpretation here (as we did for LR)?∑=−=niiTiyJ1221)()(θθx∑=−=niiTiiywJ1221)()(θθx⎟⎟⎠⎞⎜⎜⎝⎛−−=222τ)(expxxiiw12Parametric vs. non-parametricz Locally weighted linear regression is the first example we are running into of a non-parametric algorithm.z The
View Full Document