Unformatted text preview:

'&$%LEARNING THEORY• It provides a theoretical justification for all kinds of supervisedlearning methods.• We focus on classification problem (Y = 0, 1) or (Y = −1, 1).• Bayes error plays central role in learning theory.• Concentration inequalities and empirical processes are themajor tools.1'&$%General questions to be answered• Are there any errors other than Bayes error?• How do we say that one learning method is “good”?• How do we justify any specific learning method to be “good”?• What estimates are consistent for classification error?2'&$%CHAPTER 10 BAYES ERROR3'&$%Definitions• The optimal classification rule:η(X) = P (Y = 1|X),decision rule: g∗(x) = I(η(x) > 1/2)• The Bayes error (L(g∗)) isP (g∗(X) 6= Y ) = E[min(η(X), 1 − η(X))] =12−12E[|1 − 2η(X)|].• It can also be expressed in term of the densities of X given Y :Zmin(p0f0(x), p1f1(x))dµ(x) = 1 −Z|p0f0(x) −p1f1(x)|dµ(x)4'&$%Interpretation of Bayes error• It is the minimal prediction error–the best you can do!• It also characterizes certain distance between X’s distributionin Y = 0 and Y = 1, i.e., the separability of two classes in turnsof feature variables.• The Bayes error is not the only one to characterize theseparability.• Other errors may take places: Kolmogorov variational distance,nearest neighborhood error, Hellinger distance, entropy and etc.5'&$%Komogorov variational distance• Komogorov variational distance12E {|P (Y = 1|X) − P (Y = 0|X)|}• The negative Komogorov variation distance is equivalent to theBayes error.6'&$%Nearest neighborhood error• DefinitionLNN= 2E[η(X)(1 − η(X))].• The name comes from the fact that it is actually the limitingprediction error from NN learning method.• It holdsL(g∗) ≤ L∗NN≤ 2L(g∗)• The proof uses ( Association inequalities)E[Q1(X)Q2(X)] ≥ E[Q1(X)]E[Q2(X)]for monotone non-decreasing functions Q1and Q2.7'&$%Bhattacharyya measure of affinity• Defined as−log Enpη(X)(1 − η(X))o• It is related to the Hellinger distance between P (X|Y = 1) andP (X|Y = 0).• More general one:−log E£η(X)α(1 − η(X))1−α¤where α ∈ (0, 1).• It is associated with the Bayes error in a complex form.8'&$%Shannon entropy• DefinitionLS≡ −E [η(X) log η(X) + (1 − η(X)) log(1 − η(X))]• Upper bound in terms of the Bayes error:L∗S≤ log 2 −12(1 − L(g∗))2• Lower bound in terms of the Bayes error:L∗S≥ −log(1 − L(g∗))9'&$%negative Kullback-Leibler distance• DefinitionLKL= E[(2η(X) − 1) logη(X)1 − η(X)]= E·|2η(X) − 1|log1 + |2η(X) − 1|1 − |2η(X) − 1|¸.• Lower bound in terms of the Bayes error:L∗KL≥ 2(1 − L(g∗))2.10'&$%General errors• It takes form E[F (η(X))] where F is a concave function.• This error increases if replacing X by T (X) for anytransformation T .• We wish F to be close min(x, 1 − x).11'&$%Remarks• Errors are different from loss functions.• Errors are ways of characterizing the separability of two classesso they are intrinsic to data.• Loss functions are ways of weighing loss of different learningmethods so they are more user-defined.12'&$%Consistency based on Bayes error• Consistency of a classifier gn(corresponding to decisionfunction ηn(x)):P (gn(X) 6= Y ) → Bayes error.• Strongly consistent:P (gn(X) 6= Y |data) →a.s.Bayes error.• Universally (strongly) consistent if the above consistency istrue for any distribution of (X, Y ).13'&$%A key inequality if gn(x) = I(ηn(x) > 0)• A key inequality:|L(gn) − L(g∗)| ≤ E[|2η(X) − 1|I(gn(X) 6= g∗(X))|data]≤ 2E[|ηn(X) − η(X)||data]≤ 2E£(ηn(X) − η(X))2|data¤1/2.• The consistency of classifiers can be proved by showing the L1-or L2-consistency of ηn.• Notation: En[gn(X)] is E[gn(X)|data].14'&$%CHAPTER 11 CONSISTENCY OFDIRECT LEARNING METHODS15'&$%General ideas in proving the consistency• It uses the key inequality which relates the consistency ofclassification errors to the L1or L2- consistency of ηn.• Since ηnoften has explicit expression in direct learning, theconsistency follows from the L1- or L2- consistency of ηn.• For strong consistency proof, it first obtains the weakconsistency then uses concentration inequalities to concludeP (¯¯¯En[|ηn(X) − η(X)|] − E[|ηn(X) − η(X)|]¯¯¯> ²) ≤ ae−nb²2.Thus, strong consistency follows from the first Borel-Cantellilemma.16'&$%Example 1: Histogram estimator• We assume Y = −1 and +1.• Decision function: ηn(x) =PMk=1I(x ∈ Ank)ˆYk• An1, ..., Anmare a partition of the feature domain into cubeswith the same sizes andˆYk=nXi=1YiI(Xi∈ Ank)/nXi=1I(Xi∈ Ank).• Alternatively, we write ηn(x) =Pni=1YiI(Xi∈ A(x))/N(x)withA(x) is the cube contains x, N(x) =nXi=1I(Xi∈ A(x)).17'&$%Consistency results for histogram rules• Suppose hnbe the bin width of each cube.• If hn→ 0 and nhpn→ ∞, then the histogram rule is universallyand strongly consistent.18'&$%Proof of weak consistency• Let ¯η(x) = E[η(X)|X ∈ A(x)].• DecomposeE[|ηn(X) − η(X)|] ≤ E[|ηn(X) − ¯η(X)|] + E[|¯η(X) − η(X)|].• We aim to deal with both terms in RHS:stochastic term + bias term.19'&$%Stochastic term• Consider ηn(X) − ¯η(X)|.• It is bounded byI(N(X) > 0)|Pni=1YiI(Xi∈ A(X))N(X)− ¯η(X)| + I(N(X) = 0).• Note that given N(X),Pni=1YiI(Xi∈ A(X)) follows aBinomial distribution with parameters (N(X), ¯η(X)) soE[|ηn(X) − ¯η(X) |] ≤ E[I(N(X) > 0)N(X)−1/2]+P (N(X) = 0).• Need to show these two terms vanish but this is easy to achieve.20'&$%Bias term• The key is to construct a continuous function η²(X) such thatit has a compact support andE[|η²(X) − η(X)|] < ².• Then the bias term is bounded byE[|¯η(X) − ¯η²(X)|] + E[|¯η²(X) − η²(X)|] + E[|η²(X) − η(X)|]≤ 2² + E[|¯η²(X) − η²(X)|]• The second term in RHS vanishes due to the uniformcontinuity of η²(x).21'&$%Proof of Universal Strong Consistency• First, the histogram rule is equivalent togn(x) = 2[I(η∗n(x) > 0) − 0.5] whereη∗n(x) =nXi=1YiI(Xi∈ A(x))/nP (X ∈ A(x)).• Thus, it suffices to show the L1strong-consistency of η∗n; thatis,En[|η∗n(X) − η(X)|] →a.s.0.• NoteEn[|η∗n(X) − η(X)|] − E[|η∗n(X) − η(X)|] + E[|η∗n(X) − η(X)|].• From the weak consistency, it suffices to showEn[|η∗n(X) − η(X)|] − E[|η∗n(X) − η(X)|] →a.s.0.22'&$%An important inequality• Bounded difference


View Full Document

UNC-Chapel Hill BIOS 740 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?