DOC PREVIEW
ILLINOIS CS 446 - 092617.1

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 9 : AdaboostLecturer: Sanmi Koyejo Scribe: Chaoyun Chen, Sep. 26th, 2017Recap1. Decision TreeIdea of Decision TreeConstruct learners by using axis-align split in regionsHow to build a treeThe function we used to grow a tree based on every node:(j*, t*) = argminj∈1,..,dmint∈τjcost(xi, yi : xij≤t) + cost(xi, yi : xij > t)where j and t are dimensions and thresholds,τjis a set of all thresholds for dimensionj.We will check and minimizing the cost of left split and right split.examples of cost functions:EntropyGini indexError rateHow to make predictionFor binary case, we can apply majority vote and get regions Rmas shown below:12 9 : AdaboostFigure 1: Decision Tree Region SplitFor multi-class, we can still apply majority vote and determine the region by the mostpopular class.For regression, we can use square loss.How to optimize a decision treeDecision Tree tends to have low bias and high variance similar to nearest neighbor classifier.Some approaches could be applied to improve these properties:Pruning: get a smaller tree by removing leaves from the tree.Figure 2: Decision Tree PruningWhy: This approach could reduce over-fitting, increase bias by merging regions together.Testing and validation error will measure the degree of over-fitting, that will help us determinewhether an edge removal will help our prediction.How: Choosing from evaluation metric. Because we care accuracy and error rate for theoverall model, we could check them on the merging or separating validation set.Randomized averaging (bagging)Use random forest to reduce bias and variance (will cover this in next few weeks)9 : Adaboost 32. Random classifierLet h be a random classifier, such as that:h(x) =(1 P = 0.5−1 P = 0.5The error rate is 0.5, we can prove it as below:Errn(hi) =1nnXi=11[yi6=h(xi)]=1 − ACCn(hi)=1 −1nnXi=1[1[yi=1]1[h(xi)=1]+ 1[yi=−1]1[h(x)=−1]]=1 − 0.5π − 0.5(1 − π)=0.5where π =1nPni=11[yi=1]3. BoostingIdea of BoostingLearn a strong classifier by stage-wise combining weak classifier. The keywords here arestage-wise combining and weak classifier.Weak Classifierwe usual mean its error rate less than and close to 0.5 which is the error rate of randomclassifier.Examples of weak classifier:Decision Stump, which is a decision tree with only one level.This is a very simple classifier and make slightly better prediction than random classifier.4 9 : AdaboostFigure 3: Example of Decision StumpStage-wiseThe idea of accumulate classifiers one by one with weight is given by:f(x) =kXm=1αmφm(x)For the equation above,αandφare the weight and weak classifier. We train the modelstage-wise, that means train and fit classifier from φ1to φk.At themthstage, the loss functionLm(φ) could be written as logarithm loss, exponentialloss and square loss. We will discuss more on exponential loss:exp loss(f, Dn) =1nnXi=1e−yif(xi)Based on the exponential loss function, we can get a boost function called Ada Boost. AdaBoost is short for “Adaptive Boosting” and it is the first boosting algorithm, as presentedbelow:Lm(φ) =nXi=1exp(−yi[fm−1(xi) + βφ(xi)]) · · · · · · · · · (1)In the equation above,fm−1(x) +βφ(x) is the exponential loss function. Aside from theoriginal equation, here is the detailed elaboration:9 : Adaboost 5f1= α1φ1(x)f2= f1+ α2φ2(x)f3= f2+ α3φ3(x)= f1+ α2φ2(x) + α3φ3(x)= α1φ1(x) + α2φ2(x) + α3φ3(x)...fn=kXm=1αmφm(x)The goal atmstep is to findβandφthat minimizingLm(β, φ). We can rewrite formula (1)as below.Lm(β, φ) =nXi=1wi,mexp[−βyiφ(xi)]wi,m= exp(−yifm−1(xi))Because bothyandφ(x) can be{1, −1}, their product is either 1 or−1 based on theirequality. Then rewrite it as:Lm= e−βXy=φ(x)wi,m+ eβXy6=φ(x)wi,m=(eβ− e−β)Xwi,m1[yi=φ(xi)]+ e−βnXi=1wi,mThus, we can calculate the parameters β and φ:φm= argminφ(wi,m1[yi=φ(xi)])βm=12log1 − errormerrormαm= 2βmerrorm=Pni=1wi,m1[yi6=φ(xi)]Pni=1wi,m6 9 : AdaboostStep by step of Ada Boost (binary)Letwi=1NFor m from 1 to k, do the following:Fit a classifier φm(x) by using weighted loss (wi,m).Compute error.Compute αm= log[1−errorerror].Set new weight wi+1= wiexp[αm1[yi6=φ(xi)]]Returnf(x) = sign[kXm=1αmφm(x)]In Practiceworks best when φmare “weak classifiers”.Useful for reducing bias and variance.Maximizing the margin. Margin is the distance from data point to decision boundary.Works well in Go-to method for machine learning competitions, particular XG Boost,a greedy algorithm.4. Empirical Risk Minimizationminh∈HR(h, Dn)Square LossThe risk function is:R(h, Dn) =1nnXi=1(yi− h(xi))29 : Adaboost 7ifh(x) =wTx, then it is equivalent to linear regression. The insight is we start from twodifferent direction and end up in the same problem, as from probabilistic algorithm and lossfunction.We can rewrite loss function in terms of y ∗ h, the graph is shown below:Figure 4: Loss functions for binary classificationFor the loss function above, exponential loss is written ase−yh(x)and log loss is written aslog[1 + (e−yh(x))].Surrogate Loss FunctionsˆR(h, Dn) has a property that, if we have enough data, then theresult of minimizing is same as minimizing this function at population level:minh∈HˆR(h, Dn) →n→∞minh∈HR(h, P )This will be covered more on next


View Full Document

ILLINOIS CS 446 - 092617.1

Download 092617.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 092617.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 092617.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?