Real AdaBoostingThere are many extensions to AdaBoosting. One extension that was found to outperform standard Ad-aBoosting in many practical cases is described here. As in the previous case we assume access to weakclassifiers hithat produce as output +1 or −1 values.Input: the training data, given as N pairs (xi, yi), where xiis the attributes vector, and yiis the desiredoutput, either +1 or −1. The number of iterations T .Output: A function fT(x) that can be used to classify an attributes vector x.If fT(x) < 0 classify x as −1. If fT(x) > 0 classify x as 1.Initialization: Associate a probability pi=1Nwith the ith example.Iterate: For t = 1, . . . T compute the hypothesis ht, the function gt, and an update to the probabilitiesp1, . . . , pNby the following steps:a. Select (at random with replacements) a subset Stof the training examples. The ith example isselected with probability pi.b. For each classifier hjcompute the following values:P+r(j) =Ppiwith summation is over examples xisatisfying hj(xi) = 1, yi= 1P−r(j) =Ppiwith summation is over examples xisatisfying hj(xi) = −1, yi= −1P+w(j) =Ppiwith summation is over examples xisatisfying hj(xi) = −1, yi= 1P−w(j) =Ppiwith summation is over examples xisatisfying hj(xi) = 1, yi= −1G(j) =qP+r(j)P−w(j) +qP+w(j)P−r(j)Choose as hta classifier hjthat minimizes G(j). If G(t) ≥ 0.5 go back to Step a.c. Calculate the weights c+t, c−tand the function gtas follows:c+t=12lnP+r(t) + P−w(t) + , c−t=12lnP+w(t) + P−r(t) + , gt(x) =(c+tif ht(x) = 1c−tif ht(x) = −1where is a small positive number. For large values of N the recommended value is: =14Nwhere N is the sample sized. Update the probabilities.new pre-normalized pi= pie−yigt(xi)normalized pi=pre-normalized piZtwhere Ztis a normalization factor chosen so thatPipi= 1.Termination:fT(x) =TXt=1gt(x)As with standard AdaBoosting we can show that ft+1performs better than ftin terms of an upper boundon the error in classifying the training data. To measure the algorithm performance define the followingerror indicator:wrong(f, xi, yi) =(1 if f does not classify xias yi0 if f classifies xias yiThe fractional error of fTis: ET=PNi=1wrong(fT,xi,yi)NTheorem:ET≤TYt=12qP+r(t)P−w(t) +qP+w(t)P−r(t)Proof: First observe that the following inequality holds: wrong(fT, xi, yi) ≤ e−yifT(xi). Therefore:ET≤1NNXi=1e−yifT(xi)=1NNXi=1e−yiPTt=1gt(xi)(1)Let ptidenote the value of piat the beginning of iteration t. Then:pT +1i=pTie−yigT(xi)ZT=pT −1ie−yigT −1(xi)e−yigT(xi)ZTZT −1=p1ie−yiPTt=1gt(xi)QTt=1Zt=1Ne−yiPTt=1gt(xi)QTt=1ZtCombining this with (1) we have:ET≤NXi=1pT +1iTYt=1Zt=TYt=1Zt(2)This bound on ETis in terms of the Zt. It remains to express it in terms of the P values. Define:Qt(i) = pie−yigt(xi). Then:Zt=NXi=1Qt(i) =Xht(xi) = 1yi= 1Qt(i) +Xht(xi) = −1yi= −1Qt(i) +Xht(xi) = −1yi= 1Qt(i) +Xht(xi) = 1yi= −1Qt(i)= e−c+tP+r(t) + ec−tP−r(t) + e−c−tP+w(t) + ec+tP−w(t)Direct computation shows that Ztis minimized for c+t=12lnP+r(t)P−w(t), and c−t=12lnP+w(t)P−r(t), which gives:Zt= 2qP+r(t)P−w(t) +qP+w(t)P−r(t)SmoothingWith the above formulas for c+t, c−tit may happen that some of the terms are too small or even zero, inwhich case the c values will be very large or infinite. In practice, such large predictions may cause numericalproblems as well as overfitting. The alternative smoothed formulas are: c+t=12lnP+r(t)+P−w(t)+, c−t=12lnP+w(t)+P−r(t)+,where it can be shown that with < 1/2N this has a minimal effect on the Z values.Reference:R. E. Schapire and Y. Singer. Improved Boosting Algorithms Using Confidence-rated Predictions,Journal of Machine Learning, vol 37,
View Full Document