DOC PREVIEW
Berkeley STAT C241B - Lecture Notes

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 2009Lecture 2 — January 26Lecturer: Martin Wainwright Scribe: Aditi ShrikumarNote: These lecture notes are still rough, and have only have been mildly proofread.2.1 Announcements• HW1 posted, due 2/9• Scribe materials on web pageToday:• Binary classification• Perceptron and max-margin2.2 Last Time: binary classification, loss, and riskWe start with features X ∈ X and labels Y ∈ Y where (X, Y ) are distributed according to some unknownP. The goal is to find some decision rule f : X → YGiven a decision rule f : X → Y, its 0-1 loss l(y, f(x)) is:l(y, f(x)) = I[y 6= f(x)] (2.1)in other words,l(y, f(x)) =1 if y 6= f(x)0 otherwise. (2.2)And its risk R(f) is:R(f) = E[l(Y, f(X))] = P(Y 6= f(X)) (2.3)Under 0-1 loss, the risk is equal to the probability of misclassification.2.3 Today2.3.1 The optimal decision functionWhat is the optimal decision function? What is the rule g∗: X → Y such that R(g∗) = P(y 6= g∗(x)) <P(y 6= g(x)) = R(g) for all other rules g : X → Y?P is unknown in “real life”, but let’s assume that it is known for now. Let us parametrize P over (X, Y )by µ(A) = P(X ∈ A) (the marginal over features), and η(x) = P(Y = +1|X = x).Define the Bayes Decision Rule g∗(x):g∗(x) =+1 if η(x) ≥ 1/2−1 otherwise. (2.4)Claim: This is the optimal rule, i.e. all other rules g have R(g) ≥ R(g∗) Note: This is equivalent to thelikelihood ratio test:η(x) ≥ 1/2 ⇔P(Y = +1|X = x)P(Y = −1|X = x)≥ 1. (2.5)2-1EECS 281B / STAT 241B Lecture 2 — January 26 Spring 2009Theorem 2.1. For any decision function g : X → {+1, -1} we have R(g∗) ≤ R(g).Proof: For any fixed x ∈ X, and any g:P(Y 6= g(X)|X = x) = 1 − {P(Y = 1, g(X) = 1|X = 1) + P(Y = −1, g(X) = −1|X = x) (2.6)= 1 − {E[I(g(x = 1))]η(x) + E[I(g(x = −1)η(x)]}. (2.7)Hence,P(Y 6= g(X)|X = x)−P(Y 6= g∗(X)|X = x) = η(x){E[I(g∗(x) = 1)]−E[I(g(x) = 1)]}+(1−η(x)){E[I(g∗(x) = −1)]+E[I(g(x) = −1)]}(2.8)= (2η(x) − 1){E[I(g∗(x) − 1)] − E[I(g(x) = 1)]} ≥ 0 (2.9)(case-by-case). Taking expectation w.r.t X implies R(g) − R(g∗) ≥ 0 by definition of g∗. 2.3.2 Example: predicting pass/fail in a classSay the features we have are:X1= number of hours spent sleeping per day (2.10)X2= number of beers per day (2.11)X3= “laziness” – unobservable (2.12)And we want to predict whether the student with those features passes a class.Y =+1 (pass) if x1+ x2+ x3≤ 7−1 otherwise. (2.13)This defines η(x) implicitly. We want to find a decision rule based only on (X1, X2).Say the Xi’s are distributed according to Exp [1], i.i.d. We can compute the conditional probabilityη(x1, x2) = P(Y = +1|X1= x1, X2= x2) (2.14)= (some steps... left as an excercise) (2.15)= max(0, 1 − e−(7−x1−x2) (2.16)This gives us the bayes decision rule g∗:g∗=+1 if x1+ x2≤ 7 − log(2)−1 otherwise. (2.17)2.3.3 Plug-in rulesIn practice,we don’t know η(·), so we cannot compute the Bayes rule g∗. But perhaps you can use data toestimate η and apply plug-in principle:gˆη(x) =+1 if ˆη(x) ≥ 1/2−1 otherwise. (2.18)Theorem 2.2. For any plug in rule, the excess risk R(gˆη) − R(g∗) < 2 |E[η∗(x) − ˆη(x)]|Proof: If g∗(x) = gˆη(x), then P(gˆη(x) 6= Y |X = x) − P(g∗(x) 6= Y |X = x) = 0. Otherwise, whengˆη(x) 6= g∗(x), from theorem 2.1:P(gˆη(x) 6= Y |X = x) − P(g∗(x) 6= Y |X = x) (2.19)= (2η(x) − 1)E[I(g∗(x) − 1) − I(gˆη(x) = 1)] (2.20)= |2η(x) − 1|E[I(g∗(x) 6= gˆη(x))] (2.21)2-2EECS 281B / STAT 241B Lecture 2 — January 26 Spring 2009Taking expectations, this means thatR(gˆη) − R(g∗) < 2 |E[η∗(x) − ˆη(x)]| (2.22)If g∗(x) 6= gˆη(x) then |η(x) − ˆη(x)| ≥ |η(x) −1/2|. Result follows. 2.3.4 The Perceptron AlgorithmThe perceptron algorithm searches over linear threshold functionsF = {sgn[fθ(·)]|fθ(x) =< θ, x >, θ ∈ Rd} (2.23)Given samples Dn=(x(i), y(i)) ∈ X × {+1, −1}, i = 1, . . . , n, and given weight vector θ ∈ Rd, define theset of mistakes m(θ) =i ∈ {1, . . . , n} : y(i)hθ, x(i)i < 0..The perceptron algorithm generates a sequence of vectors θ0, θ1, θ2, . . ..1. Initialize θ0= 0.2. For t = 0, 1, 2, . . ., while m(θt) 6= ∅, pick some i ∈ m(θt) and update θt+1= θt+ y(i)x(i).This algorithm needs classes that are separated by lines.Definition: A data set is linearly separable if there exists a θ ∈ Rdsuch that m(θ) = ∅ i.e. there is aline between the two classes that makes no mistakes.Theorem 2.3. For any linearly separable data set Dn, the perceptron algorithm terminates in at mostT = R2/δ2steps whereR = maxi=1,...,nkx(i)k2(2.24)δ = mini=1,...,ny(i)hθ∗, x(i)ikθ∗k2> 0 for some θ∗with m(θ∗) = ∅ (2.25)δ is called the margin. It is the distance between the line θ, and the closest data point to that line.Proof: Say we make a mistake on sample i at step t. Thenhθ∗, θt+1i = hθ∗, θti + y(i)hθ∗, x(i)i (2.26)≥ hθ∗, θti + δkθ∗k2(by definition of δ). (2.27)Hence, by induction, we get thathθ∗, θti ≥ tδkθ∗k2(2.28)On the other hand,kθt+1k22= kθtk22+ kx(i)k22+ 2y(i)hθt, x(i)i (2.29)≤ kθtk22+ R2, because y(i)hθt, x(i)i < 0 (mistake) (2.30)=⇒ kθtk22≤ tR2, by induction (2.31)Hence:δtkθ∗k2≤ hθ∗, θti (2.32)≤ kθ∗k2kθtk2(2.33)≤ kθ∗k2√tR (2.34)=⇒ t ≤R2δ2(2.35)


View Full Document

Berkeley STAT C241B - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?