EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 2009Lecture 2 — January 26Lecturer: Martin Wainwright Scribe: Aditi ShrikumarNote: These lecture notes are still rough, and have only have been mildly proofread.2.1 Announcements• HW1 posted, due 2/9• Scribe materials on web pageToday:• Binary classification• Perceptron and max-margin2.2 Last Time: binary classification, loss, and riskWe start with features X ∈ X and labels Y ∈ Y where (X, Y ) are distributed according to some unknownP. The goal is to find some decision rule f : X → YGiven a decision rule f : X → Y, its 0-1 loss l(y, f(x)) is:l(y, f(x)) = I[y 6= f(x)] (2.1)in other words,l(y, f(x)) =1 if y 6= f(x)0 otherwise. (2.2)And its risk R(f) is:R(f) = E[l(Y, f(X))] = P(Y 6= f(X)) (2.3)Under 0-1 loss, the risk is equal to the probability of misclassification.2.3 Today2.3.1 The optimal decision functionWhat is the optimal decision function? What is the rule g∗: X → Y such that R(g∗) = P(y 6= g∗(x)) <P(y 6= g(x)) = R(g) for all other rules g : X → Y?P is unknown in “real life”, but let’s assume that it is known for now. Let us parametrize P over (X, Y )by µ(A) = P(X ∈ A) (the marginal over features), and η(x) = P(Y = +1|X = x).Define the Bayes Decision Rule g∗(x):g∗(x) =+1 if η(x) ≥ 1/2−1 otherwise. (2.4)Claim: This is the optimal rule, i.e. all other rules g have R(g) ≥ R(g∗) Note: This is equivalent to thelikelihood ratio test:η(x) ≥ 1/2 ⇔P(Y = +1|X = x)P(Y = −1|X = x)≥ 1. (2.5)2-1EECS 281B / STAT 241B Lecture 2 — January 26 Spring 2009Theorem 2.1. For any decision function g : X → {+1, -1} we have R(g∗) ≤ R(g).Proof: For any fixed x ∈ X, and any g:P(Y 6= g(X)|X = x) = 1 − {P(Y = 1, g(X) = 1|X = 1) + P(Y = −1, g(X) = −1|X = x) (2.6)= 1 − {E[I(g(x = 1))]η(x) + E[I(g(x = −1)η(x)]}. (2.7)Hence,P(Y 6= g(X)|X = x)−P(Y 6= g∗(X)|X = x) = η(x){E[I(g∗(x) = 1)]−E[I(g(x) = 1)]}+(1−η(x)){E[I(g∗(x) = −1)]+E[I(g(x) = −1)]}(2.8)= (2η(x) − 1){E[I(g∗(x) − 1)] − E[I(g(x) = 1)]} ≥ 0 (2.9)(case-by-case). Taking expectation w.r.t X implies R(g) − R(g∗) ≥ 0 by definition of g∗. 2.3.2 Example: predicting pass/fail in a classSay the features we have are:X1= number of hours spent sleeping per day (2.10)X2= number of beers per day (2.11)X3= “laziness” – unobservable (2.12)And we want to predict whether the student with those features passes a class.Y =+1 (pass) if x1+ x2+ x3≤ 7−1 otherwise. (2.13)This defines η(x) implicitly. We want to find a decision rule based only on (X1, X2).Say the Xi’s are distributed according to Exp [1], i.i.d. We can compute the conditional probabilityη(x1, x2) = P(Y = +1|X1= x1, X2= x2) (2.14)= (some steps... left as an excercise) (2.15)= max(0, 1 − e−(7−x1−x2) (2.16)This gives us the bayes decision rule g∗:g∗=+1 if x1+ x2≤ 7 − log(2)−1 otherwise. (2.17)2.3.3 Plug-in rulesIn practice,we don’t know η(·), so we cannot compute the Bayes rule g∗. But perhaps you can use data toestimate η and apply plug-in principle:gˆη(x) =+1 if ˆη(x) ≥ 1/2−1 otherwise. (2.18)Theorem 2.2. For any plug in rule, the excess risk R(gˆη) − R(g∗) < 2 |E[η∗(x) − ˆη(x)]|Proof: If g∗(x) = gˆη(x), then P(gˆη(x) 6= Y |X = x) − P(g∗(x) 6= Y |X = x) = 0. Otherwise, whengˆη(x) 6= g∗(x), from theorem 2.1:P(gˆη(x) 6= Y |X = x) − P(g∗(x) 6= Y |X = x) (2.19)= (2η(x) − 1)E[I(g∗(x) − 1) − I(gˆη(x) = 1)] (2.20)= |2η(x) − 1|E[I(g∗(x) 6= gˆη(x))] (2.21)2-2EECS 281B / STAT 241B Lecture 2 — January 26 Spring 2009Taking expectations, this means thatR(gˆη) − R(g∗) < 2 |E[η∗(x) − ˆη(x)]| (2.22)If g∗(x) 6= gˆη(x) then |η(x) − ˆη(x)| ≥ |η(x) −1/2|. Result follows. 2.3.4 The Perceptron AlgorithmThe perceptron algorithm searches over linear threshold functionsF = {sgn[fθ(·)]|fθ(x) =< θ, x >, θ ∈ Rd} (2.23)Given samples Dn=(x(i), y(i)) ∈ X × {+1, −1}, i = 1, . . . , n, and given weight vector θ ∈ Rd, define theset of mistakes m(θ) =i ∈ {1, . . . , n} : y(i)hθ, x(i)i < 0..The perceptron algorithm generates a sequence of vectors θ0, θ1, θ2, . . ..1. Initialize θ0= 0.2. For t = 0, 1, 2, . . ., while m(θt) 6= ∅, pick some i ∈ m(θt) and update θt+1= θt+ y(i)x(i).This algorithm needs classes that are separated by lines.Definition: A data set is linearly separable if there exists a θ ∈ Rdsuch that m(θ) = ∅ i.e. there is aline between the two classes that makes no mistakes.Theorem 2.3. For any linearly separable data set Dn, the perceptron algorithm terminates in at mostT = R2/δ2steps whereR = maxi=1,...,nkx(i)k2(2.24)δ = mini=1,...,ny(i)hθ∗, x(i)ikθ∗k2> 0 for some θ∗with m(θ∗) = ∅ (2.25)δ is called the margin. It is the distance between the line θ, and the closest data point to that line.Proof: Say we make a mistake on sample i at step t. Thenhθ∗, θt+1i = hθ∗, θti + y(i)hθ∗, x(i)i (2.26)≥ hθ∗, θti + δkθ∗k2(by definition of δ). (2.27)Hence, by induction, we get thathθ∗, θti ≥ tδkθ∗k2(2.28)On the other hand,kθt+1k22= kθtk22+ kx(i)k22+ 2y(i)hθt, x(i)i (2.29)≤ kθtk22+ R2, because y(i)hθt, x(i)i < 0 (mistake) (2.30)=⇒ kθtk22≤ tR2, by induction (2.31)Hence:δtkθ∗k2≤ hθ∗, θti (2.32)≤ kθ∗k2kθtk2(2.33)≤ kθ∗k2√tR (2.34)=⇒ t ≤R2δ2(2.35)
View Full Document