DOC PREVIEW
CMU CS 10701 - Lecture

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Machine LearningMachine Learning1010--701/15701/15--781, Fall 2008781, Fall 2008Introduction to RegressionIntroduction to RegressionEric XingEric XingLecture 4, January 28, 2006Reading: Chap. 3, CBFunctional Approximation, cont.2Machine learning for apartment hunting z Now you've moved to Pittsburgh!! And you want to find the most reasonably priced apartment satisfying your needs:square-ft., # of bedroom, distance to campus …?1.5270…?1150500110911002433100025066001230Rent ($)# bedroomLiving area (ft2)The learning problemz Features: z Living area, distance to campus, # bedroom …z Denote as x=[x1, x2, … xk]z Target: z Rentz Denoted as yz Training set:rentLiving arearentLiving areaLocation⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎣⎡=⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−−−−−−−=knnnkknxxxxxxxxxKMMMMKKMMM21222121211121xxxX⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=nnyyyyyyYMMMM2121or3Linear Regressionz Assume that Y (target) is a linear function of X (features):z e.g.:z let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:z then we have the following general representation of the linear function:z Our goal is to pick the optimal . How!z We seek that minimize the following cost function:)2(2)1(10ˆxxyθθθ++=θθ∑=−=niiiiyxyJ1221))(ˆ()(vθThe Least-Mean-Square (LMS) methodz The Cost Function:z Consider a gradient descent algorithm:∑=−=niiTiyJ1221)()(θθxtjtjtjJ )(θθαθθ∂∂−=+14The Least-Mean-Square (LMS) methodz Now we have the following descent rule: z For a single training point, we have: z This is known as the LMS update rule, or the Widrow-Hoff learning rulez This is actually a "stochastic", "coordinate" descent algorithmz This can be used as a on-line algorithm∑=+−+=niintTnntjtjxy11,)(θαθθxThe Least-Mean-Square (LMS) methodz Steepest descentz Note that:z This is as a batch gradient descent algorithm∑=−−=⎥⎦⎤⎢⎣⎡∂∂∂∂=∇ninTnnTkyJJJ11xx )(,,θθθθK∑=+−+=nintTnntty11xx )(θαθθ5Some matrix derivativesz For , define:z Trace:z Some fact of matrix derivatives (without proof)⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎣⎡∂∂∂∂∂∂∂∂=∇fAfAfAfAAfmnmnALMOML1111)(RR anmf×:, tr∑==niiiAA1, traa=BCACABABC trtrtr==, trTABAB =∇, trTTTAABCCABCABA +=∇() TAAAA1−=∇The normal equationsz Write the cost function in matrix form:z To minimize J(θ), take derivative and set to zero:()()()yyXyyXXXyXyXyJTTTTTTTniiTivvvvvv+−−=−−=−=∑=θθθθθθθθ21212112)()( x⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−−−−−−−=nxxxXMMM21⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡=nyyyyMv21()()()022122121=−=−+=∇+∇−∇=+−−∇=∇yXXXyXXXXXyyXyXXyyXyyXXXJTTTTTTTTTTTTTTTvvvvvvvvvθθθθθθθθθθθθθθθtrtrtrtryXXXTTv=⇒θ The normal equations()yXXXTTv1−=*θ⇓6A recap:z LMS update rulez Pros: on-line, low per-step costz Cons: coordinate, maybe slow-convergingz Steepest descentz Pros: fast-converging, easy to implementz Cons: a batch, z Normal equationsz Pros: a single-shot algorithm! Easiest to implement.z Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues (e.g., matrix is singular ..)intTnntjtjxy,)(θαθθx−+=+1∑=+−+=nintTnntty11xx )(θαθθ()yXXXTTv1−=*θGeometric Interpretation of LMSz The predictions on the training data are:z Note thatand is the orthogonal projection ofinto the space spanned by the columns of X()yXXXXXyTTvv1−==*ˆθ()()yIXXXXyyTTvvv−=−−1ˆ()()()()()011=−=−=−−−yXXXXXXyIXXXXXyyXTTTTTTTTvvvvˆ!!yˆvyv⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡−−−−−−−−−−−−=nxxxXMMM217Probabilistic Interpretation of LMSz Let us assume that the target variable and the inputs are related by the equation:where ε is an error term of unmodeled effects or random noisez Now assume that ε follows a Gaussian N(0,σ), then we have:z By independence assumption:iiTiyεθ+= x⎟⎟⎠⎞⎜⎜⎝⎛−−=22221σθσπθ)(exp);|(iTiiiyxypx⎟⎟⎠⎞⎜⎜⎝⎛−−⎟⎠⎞⎜⎝⎛==∑∏==2121221σθσπθθniiTinniiiyxypL)(exp);|()(xProbabilistic Interpretation of LMS, cont.z Hence the log-likelihood is:z Do you recognize the last term?Yes it is: z Thus under independence assumption, LMS is equivalent to MLE of θ !∑=−−=niiTiynl12221121)(log)( xθσσπθ∑=−=niiTiyJ1221)()(θθx8Beyond basic LRz LR with non-linear basis functionsz Locally weighted linear regressionz Regression trees and Multilinear InterpolationLR with non-linear basis functionsz LR does not mean we can only deal with linear relationshipsz We are free to design (non-linear) features under LRwhere the φj(x) are fixed basis functions (and we define φ0(x) = 1).z Example: polynomial regression:z We will be concerned with estimating (distributions over) the weights θ and choosing the model order M.)()( xxyTmjjφθφθθ=+=∑=10[]321 xxxx ,,,:)( =φ9Basis functionsz There are many basis functions, e.g.:z Polynomialz Radial basis functionsz Sigmoidalz Splines, Fourier, Wavelets, etc1−=jjxx)(φ()⎟⎟⎠⎞⎜⎜⎝⎛−−=222sxxjjµφexp)(⎟⎟⎠⎞⎜⎜⎝⎛−=sxxjjµσφ)(1D and 2D RBFsz 1D RBFz After fit:10Good and Bad RBFsz A good 2D RBFz Two bad 2D RBFsLocally weighted linear regressionz Overfitting and underfittingxy10θθ+=2210xxyθθθ++=∑==50jjjxyθ11Bias and variancez we define the bias of a model to be the expected generalization error even if we were to train it to a very (say,infinitely) large training set.z By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance.Locally weighted linear regressionz The algorithm:Instead of minimizingnow we fit θ to minimizeWhere do wi's come from? z where x is the query point for which we'd like to know its corresponding yÆ Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query)z Do we also have a probabilistic interpretation here (as we did for LR)?∑=−=niiTiyJ1221)()(θθx∑=−=niiTiiywJ1221)()(θθx⎟⎟⎠⎞⎜⎜⎝⎛−−=222τ)(expxxiiw12Parametric vs. non-parametricz Locally weighted linear regression is yet another example we are running into of a non-parametric algorithm.z The


View Full Document

CMU CS 10701 - Lecture

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?