DOC PREVIEW
UT Dallas CS 6375 - 01 - cs229-notes1

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS229 Lecture notesAndrew NgSupervised learni ngLet’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:Living area (feet2)Price (1000$s)2104 4001600 3302400 36914162323000 540......We can plot t his data:500 1000 1500 2000 2500 3000 3500 4000 4500 500001002003004005006007008009001000housing pricessquare feetprice (in $1000)Given data like this, how can we learn to predict the prices of ot her housesin Portland, as a function of the size of their living areas?1CS229 Fall 2012 2To establish notation for future use, we’ll use x(i)to denote the “ input”variables (living a r ea in this example), also called input features, and y(i)to denote the “output” o r target variable that we are trying to predict(price). A pair (x(i), y(i)) is called a training example, and the data setthat we’ll be using to learn—a list of m training examples {(x(i), y(i)); i =1, . . . , m}—is called a training set. Note that the superscript “(i)” in thenotation is simply an index int o the training set, and has nothing to do withexp onentiation. We will also use X denote the space of input values, and Ythe space of output values. In this example, X = Y = R.To describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a function h : X 7→ Y so that h(x) is a“good” predictor for the corresponding value of y. For historical reasons, thisfunction h is called a hypothesis. Seen pictorially, the process is thereforelike this:Training set house.)(living area ofLearning algorithmhpredicted yx(predicted price)of house)When the target variable t hat we’re trying to predict is continuous, suchas in our housing example, we call the learning problem a regression prob-lem. When y can take on only a small number of discrete values (such asif, given the living area, we wanted t o predict if a dwelling is a house or anapartment, say), we call it a classification problem.3Part ILinear Regressi onTo make our housing example more interesting, let’s consider a slightly richerdataset in which we also know the number of bedrooms in each house:Living area (feet2)#bedrooms Price (100 0$s)2104 3 4001600 3 3302400 3 36914162 2323000 4 540.........Here, the x’s are two-dimensional vectors in R2. For instance, x(i)1is theliving area of the i-th house in the training set, and x(i)2is its number ofbedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gat heringhousing dat a, you might also decide to include ot her features such as whethereach house has a fireplace, the number of bathrooms, and so on. We’ll saymore about feature selection later, but for now let’s take the features a sgiven.)To perform supervised learning, we must decide how we’re going to rep-resent functions/hypotheses h in a computer. As an initial choice, let’s saywe decide to approximate y as a linear function of x:hθ(x) = θ0+ θ1x1+ θ2x2Here, the θi’s are the parameters (also called weights) parameterizing thespace of linear functions mapping from X to Y. When there is no risk o fconfusion, we will drop the θ subscript in hθ(x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx0= 1 (this is t he intercept term), so t hath(x) =nXi=0θixi= θTx,where on the r ight-hand side above we are viewing θ and x bot h as vectors,and here n is t he number of input variables (not counting x0).4Now, given a training set, how do we pick, or learn, the parameters θ?One reasonable method seems to be to make h(x) close to y, at least forthe training examples we have. To formalize this, we will define a functionthat measures, for each value of the θ’s, how close the h(x(i))’s are to thecorresponding y(i)’s. We define the cost function:J(θ) =12mXi=1(hθ(x(i)) −y(i))2.If you’ve seen linear r egr ession befo r e, you may recognize t his as the familiarleast-squares cost function that gives r ise to the ordinary least squaresregression model. Whether or not you have seen it previously, let’s keepgoing, and we’ll eventually show this to be a special case of a much broaderfamily of algorithms.1 LMS algorithmWe want to choose θ so as to minimize J(θ). To do so, let’s use a searchalgorithm that starts with some “initial guess” for θ, and tha t repeatedlychanges θ to make J(θ) smaller, until hopefully we converge to a value ofθ that minimizes J(θ). Specifically, let’s consider the gradient descentalgorithm, which starts with some init ial θ, and repeatedly performs theupdate:θj:= θj− α∂∂θjJ(θ).(This update is simultaneously performed for all values of j = 0, . . . , n.)Here, α is called the learning rate. This is a very natural algorithm thatrepeatedly takes a step in the direction of steepest decrease of J.In order t o implement this algorithm, we have to work out what is thepartial derivative term on the right hand side. Let’s first wo rk it out for thecase of if we have only one t raining example (x, y), so that we can neglectthe sum in t he definition of J. We have:∂∂θjJ(θ) =∂∂θj12(hθ(x) −y)2= 2 ·12(hθ(x) − y) ·∂∂θj(hθ(x) − y)= (hθ(x) − y) ·∂∂θj nXi=0θixi− y!= (hθ(x) − y) xj5For a single training example, this gives the update rule:1θj:= θj+ αy(i)− hθ(x(i))x(i)j.The rule is called the LMS update rule (LMS stands for “least mean squares”),and is also known as the Widrow-Hoff learning rule. This rule has severalproperties that seem natural and intuitive. For instance, the magnitude ofthe update is proportional to the er ror term ( y(i)− hθ(x(i))); thus, for in-stance, if we are encountering a training example on which our predictionnearly matches the actual value of y(i), then we find that there is little needto change the parameters; in contrast, a larger change to the parameters willbe made if our prediction hθ(x(i)) has a large error (i.e., if it is very far fromy(i)).We’d derived the LMS rule for when there was only a single trainingexample. There are two ways to modify this method for a training set ofmore than one example. The first is r eplace it with the following algorithm:Repeat until convergence {θj:= θj+ αPmi=1y(i)−hθ(x(i))x(i)j(for every j).}The reader can easily verify that the quantity in the summation


View Full Document

UT Dallas CS 6375 - 01 - cs229-notes1

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download 01 - cs229-notes1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 01 - cs229-notes1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 01 - cs229-notes1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?