DOC PREVIEW
CMU CS 10601 - Linear Classifiers and the Perceptron

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Linear Classifiers and the PerceptronWilliam CohenFebruary 4, 20081 Linear classifiersLet’s assume that every instance is an n-dimensional vector of real numbers x ∈ Rn, andthere are only two possible classes, y = (+1) and y = (−1), so every example is a pair(x, y). (Notation: I will use boldface to indicate vectors here, so x = hx1, . . . , xni). A linearclassifier is a vector w that makes the predictionˆy = s ign(nXi=1wixi)where sign(x) = +1 if x ≥ 0 and sign(x) = −1 if x < 0.1If you remember your linearalgebra this weighted sum of xi’s is called the inner product of w and x and it’s usuallywritten w · x, so this classifier can be written even more compactly asˆy = sign(w · x)Visually, for a vector w, x ·w is the distance of the result you get “if you project x onto w”(see Figure 1).It might seem that representing examples as real-number vectors is somewhat constrain-ing. It seems fine if your attributes are numeric (e.g., “Temperature=72”) but what ifyou have an attribute “Outlook” with three possible discrete values “Rainy”, “Sunny”, and“Cloudy”? One answer is to replace this single attribute with three binary attributes: onethat set to 1 when the outlook is rainy, and zero otherwise; one that set to 1 when theoutlook is sunny, and zero otherwise; and one that set to 1 when the outlook is cloudy, andzero otherwise. So a dataset like the one below would be converted to examples in R4asshown:Outlook Temp PlayTennis?Day1 Rainy 85 No −→ (h1, 0, 0, 85i, −1)Day2 Sunny 87 No −→ (h1, 0, 0, 87i, −1)Day3 Cloudy 75 Yes −→ (h0, 0, 1, 75i, +1)1This is a little different from the usual definition, where sign(0) = 0, but I’d rather not have to deal withthe question of what to predict whenPni=1wixi= 0.1Figure 1: A geometric view of the inner product.Another answer, of course, is to abandon these discrete values and instead focus onnumeric attributes—e.g., let “Outlook” be encoded as a real number representing the prob-ability of rain, so that Day2 would be encoded as h0.01, 87i, where the 0.01 means a 1/100chance of rain.However, encoding your examples as pure numeric vectors vectors has one small problem.As I’ve defined it, a linear classifier is doomed to predict ˆy = 1 on a perfectly sunny day(Outlook=0) if the temp e rature is also zero—regardless of what weight vector w you pick,w ·h0, 0i will be ze ro, and ˆy will be one. Since zero-degree weather isn’t conducive to playingtennis, no matter how clear it is, we can add one more trick to our encoding of examplesand add an extra dimension to each vector, with a value that is always one. Let’s label thatextra dimension 0: thenˆy = s ign(w0x0+nXi=1wixi) = sign(w0+nXi=1wixi) (1)the second half of the equation holding because for every example x, x0= 1.This trick gives our linear classifier a bit more expressive power, and we can still writeour classifier using the super-compact notation ˆy = sign(wx) if we like. The weight w0issometimes called a bias term.22 Naive Bayes is a linear classiferHow do you learn a linear classifier? Well, you already know one way. To make things simple,I’ll assume that x is not just a real-valued vector, but is a binary vector, in the discussionbelow.You remember that Naive Bayes can be written as follows:ˆy = argmaxyP (y|x) = argmaxyP (x|y)P (y) = argmaxynYi=1P (xi|y)P (y)Since the log function is monotonic we can write this asˆy = argmaxylognYi=1P (xi|y)P (y) = argmaxy nXi=1log P (xi|y) + log P (y)!And if there are only two classes, y = +1 and y = −1, we can write this asˆy = sign" nXi=1log P (xi|Y=+1) + log P (Y=+1)!− nXi=1log P (xi|Y=-1) + log P (Y=-1)!#which we can rearrange asˆy = sign nXi=1(log P (xi|Y=+1) −log P (xi|Y=-1)) + (log P (Y=+1) − log P (Y=-1))!and if we use the fact that log x −log(y) = logxy, we can write this as2ˆy = sign nXi=1logP (xi|Y=+1)P (xi|Y=-1)+ logP (Y=+1)P (Y=-1)!(2)This is starting to look a little more linear! To finish the job, let’s think about what thismeans. When we say logP (xi|Y=+1)P (xi|Y=-1), we’re using that to describe a function of xi, whichcould be written out aslogP (xi|Y=+1)P (xi|Y=-1)≡ f(xi) ≡logP (Xi=1|Y=+1)P (Xi=1|Y=-1)if xi= 1logP (Xi=0|Y=+1)P (Xi=0|Y=-1)if xi= 0(3)To make the next few equations uncluttered let’s define piand qiaspi≡ logP (Xi= 1|Y=+1)P (Xi= 1|Y=-1)qi≡ logP (Xi= 0|Y=+1)P (Xi= 0|Y=-1)2As an aside, expressions like o = logP (Y=+1)P (Y=-1)are called log odds, and they mean something. If the logsare base 2 and o = 3, then the event Y=+1 is 23= 8 times as likely as the event Y=-1, while if o = −4 thenthe event Y=-1 is about 24= 16 times as likely as the event Y=+1.3A slightly tricky way to get rid of the if-thens in Equation 3 is to write it asf(xi) ≡ xipi+ (1 − xi)qi(This is essentially the same trick as Tom used in deriving the MLE for a binomial - do yousee why?) This of course can be written asf(xi) = xi(pi− qi) + qi(4)and then plugging Equation 4 into Equation 2 we getˆy = s ign nXi=1(xi(pi− qi) + qi) + logP (Y=+1)P (Y=-1)!(5)Now, we’re almost done. Let’s define wi= pi− qiand define :wi≡ pi− qiw0≡Xiqi+ logP (Y=+1)P (Y=-1)where w0is that “bias term” we used in Equation 1. Now the Naive Bayes prediction fromEquation 5 becomesˆy = s ign(nXi=1xiwi+ w0)Putting it all together: for binary vectors x0= hx1, . . . , xni Naive Bayes can be imple-mented as a linear classifier, to be applied to the augmented example x = hx0= 1, x1, . . . , xni.The weight vector w has this form:wi≡ logP (Xi= 1|Y=+1)P (Xi= 1|Y=-1)− logP (Xi= 0|Y=+1)P (Xi= 0|Y=-1)w0≡Xi(logP (Xi= 0|Y=+1)P (Xi= 0|Y=-1)) + logP (Y=+1)P (Y=-1)and the Naive Bayes classific ation isˆy = sign(w · x)3 Online learning for classification3.1 Online learningBayesian probability is the most-studied mathematical model of learning. But there areother models that also are experimentally successful, and give useful insight. Sometimesthese models are also mathematically more appropriate: e.g., even if all the probabilisticassumptions are correct, a MAP estimate might not minimize prediction errors.4Another usef ul model is on-line learning. In this document, I’ll consider on-line learnersthat can do two things. The first is to make a prediction ˆy on an example x, where ˆy ∈{−1, +1}. The second is to update the learner’s state, by “accepting” a new example hx, yi.As some background, remember that if θ is the angle between x


View Full Document

CMU CS 10601 - Linear Classifiers and the Perceptron

Documents in this Course
lecture

lecture

40 pages

Problem

Problem

12 pages

lecture

lecture

36 pages

Lecture

Lecture

31 pages

Review

Review

32 pages

Lecture

Lecture

11 pages

Lecture

Lecture

18 pages

Notes

Notes

10 pages

Boosting

Boosting

21 pages

review

review

21 pages

review

review

28 pages

Lecture

Lecture

31 pages

lecture

lecture

52 pages

Review

Review

26 pages

review

review

29 pages

Lecture

Lecture

37 pages

Lecture

Lecture

35 pages

Boosting

Boosting

17 pages

Review

Review

35 pages

lecture

lecture

32 pages

Lecture

Lecture

28 pages

Lecture

Lecture

30 pages

lecture

lecture

29 pages

leecture

leecture

41 pages

lecture

lecture

34 pages

review

review

38 pages

review

review

31 pages

Lecture

Lecture

41 pages

Lecture

Lecture

15 pages

Lecture

Lecture

21 pages

Lecture

Lecture

38 pages

Notes

Notes

37 pages

lecture

lecture

29 pages

Load more
Download Linear Classifiers and the Perceptron
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Linear Classifiers and the Perceptron and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Linear Classifiers and the Perceptron 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?