CMU CS 10601 - Linear Classifiers and the Perceptron - D1730390

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10601> Linear Classifiers and the Perceptron

DOC PREVIEW

CMU CS 10601 - Linear Classifiers and the Perceptron

School name Carnegie Mellon University

Course Cs 10601- Introduction to Machine Learning

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Linear Classifiers and the PerceptronWilliam CohenFebruary 4, 20081 Linear classifiersLet’s assume that every instance is an n-dimensional vector of real numbers x ∈ Rn, andthere are only two possible classes, y = (+1) and y = (−1), so every example is a pair(x, y). (Notation: I will use boldface to indicate vectors here, so x = hx1, . . . , xni). A linearclassifier is a vector w that makes the predictionˆy = s ign(nXi=1wixi)where sign(x) = +1 if x ≥ 0 and sign(x) = −1 if x < 0.1If you remember your linearalgebra this weighted sum of xi’s is called the inner product of w and x and it’s usuallywritten w · x, so this classifier can be written even more compactly asˆy = sign(w · x)Visually, for a vector w, x ·w is the distance of the result you get “if you project x onto w”(see Figure 1).It might seem that representing examples as real-number vectors is somewhat constrain-ing. It seems fine if your attributes are numeric (e.g., “Temperature=72”) but what ifyou have an attribute “Outlook” with three possible discrete values “Rainy”, “Sunny”, and“Cloudy”? One answer is to replace this single attribute with three binary attributes: onethat set to 1 when the outlook is rainy, and zero otherwise; one that set to 1 when theoutlook is sunny, and zero otherwise; and one that set to 1 when the outlook is cloudy, andzero otherwise. So a dataset like the one below would be converted to examples in R4asshown:Outlook Temp PlayTennis?Day1 Rainy 85 No −→ (h1, 0, 0, 85i, −1)Day2 Sunny 87 No −→ (h1, 0, 0, 87i, −1)Day3 Cloudy 75 Yes −→ (h0, 0, 1, 75i, +1)1This is a little different from the usual definition, where sign(0) = 0, but I’d rather not have to deal withthe question of what to predict whenPni=1wixi= 0.1Figure 1: A geometric view of the inner product.Another answer, of course, is to abandon these discrete values and instead focus onnumeric attributes—e.g., let “Outlook” be encoded as a real number representing the prob-ability of rain, so that Day2 would be encoded as h0.01, 87i, where the 0.01 means a 1/100chance of rain.However, encoding your examples as pure numeric vectors vectors has one small problem.As I’ve defined it, a linear classifier is doomed to predict ˆy = 1 on a perfectly sunny day(Outlook=0) if the temp e rature is also zero—regardless of what weight vector w you pick,w ·h0, 0i will be ze ro, and ˆy will be one. Since zero-degree weather isn’t conducive to playingtennis, no matter how clear it is, we can add one more trick to our encoding of examplesand add an extra dimension to each vector, with a value that is always one. Let’s label thatextra dimension 0: thenˆy = s ign(w0x0+nXi=1wixi) = sign(w0+nXi=1wixi) (1)the second half of the equation holding because for every example x, x0= 1.This trick gives our linear classifier a bit more expressive power, and we can still writeour classifier using the super-compact notation ˆy = sign(wx) if we like. The weight w0issometimes called a bias term.22 Naive Bayes is a linear classiferHow do you learn a linear classifier? Well, you already know one way. To make things simple,I’ll assume that x is not just a real-valued vector, but is a binary vector, in the discussionbelow.You remember that Naive Bayes can be written as follows:ˆy = argmaxyP (y|x) = argmaxyP (x|y)P (y) = argmaxynYi=1P (xi|y)P (y)Since the log function is monotonic we can write this asˆy = argmaxylognYi=1P (xi|y)P (y) = argmaxy nXi=1log P (xi|y) + log P (y)!And if there are only two classes, y = +1 and y = −1, we can write this asˆy = sign" nXi=1log P (xi|Y=+1) + log P (Y=+1)!− nXi=1log P (xi|Y=-1) + log P (Y=-1)!#which we can rearrange asˆy = sign nXi=1(log P (xi|Y=+1) −log P (xi|Y=-1)) + (log P (Y=+1) − log P (Y=-1))!and if we use the fact that log x −log(y) = logxy, we can write this as2ˆy = sign nXi=1logP (xi|Y=+1)P (xi|Y=-1)+ logP (Y=+1)P (Y=-1)!(2)This is starting to look a little more linear! To finish the job, let’s think about what thismeans. When we say logP (xi|Y=+1)P (xi|Y=-1), we’re using that to describe a function of xi, whichcould be written out aslogP (xi|Y=+1)P (xi|Y=-1)≡ f(xi) ≡logP (Xi=1|Y=+1)P (Xi=1|Y=-1)if xi= 1logP (Xi=0|Y=+1)P (Xi=0|Y=-1)if xi= 0(3)To make the next few equations uncluttered let’s define piand qiaspi≡ logP (Xi= 1|Y=+1)P (Xi= 1|Y=-1)qi≡ logP (Xi= 0|Y=+1)P (Xi= 0|Y=-1)2As an aside, expressions like o = logP (Y=+1)P (Y=-1)are called log odds, and they mean something. If the logsare base 2 and o = 3, then the event Y=+1 is 23= 8 times as likely as the event Y=-1, while if o = −4 thenthe event Y=-1 is about 24= 16 times as likely as the event Y=+1.3A slightly tricky way to get rid of the if-thens in Equation 3 is to write it asf(xi) ≡ xipi+ (1 − xi)qi(This is essentially the same trick as Tom used in deriving the MLE for a binomial - do yousee why?) This of course can be written asf(xi) = xi(pi− qi) + qi(4)and then plugging Equation 4 into Equation 2 we getˆy = s ign nXi=1(xi(pi− qi) + qi) + logP (Y=+1)P (Y=-1)!(5)Now, we’re almost done. Let’s define wi= pi− qiand define :wi≡ pi− qiw0≡Xiqi+ logP (Y=+1)P (Y=-1)where w0is that “bias term” we used in Equation 1. Now the Naive Bayes prediction fromEquation 5 becomesˆy = s ign(nXi=1xiwi+ w0)Putting it all together: for binary vectors x0= hx1, . . . , xni Naive Bayes can be imple-mented as a linear classifier, to be applied to the augmented example x = hx0= 1, x1, . . . , xni.The weight vector w has this form:wi≡ logP (Xi= 1|Y=+1)P (Xi= 1|Y=-1)− logP (Xi= 0|Y=+1)P (Xi= 0|Y=-1)w0≡Xi(logP (Xi= 0|Y=+1)P (Xi= 0|Y=-1)) + logP (Y=+1)P (Y=-1)and the Naive Bayes classific ation isˆy = sign(w · x)3 Online learning for classification3.1 Online learningBayesian probability is the most-studied mathematical model of learning. But there areother models that also are experimentally successful, and give useful insight. Sometimesthese models are also mathematically more appropriate: e.g., even if all the probabilisticassumptions are correct, a MAP estimate might not minimize prediction errors.4Another usef ul model is on-line learning. In this document, I’ll consider on-line learnersthat can do two things. The first is to make a prediction ˆy on an example x, where ˆy ∈{−1, +1}. The second is to update the learner’s state, by “accepting” a new example hx, yi.As some background, remember that if θ is the angle between x

View Full Document