# ILLINOIS CS 446 - 092617.1 (7 pages)

Previewing pages*1, 2*of 7 page document

**View the full content.**## 092617.1

Previewing pages
*1, 2*
of
actual document.

**View the full content.**View Full Document

## 092617.1

0 0 63 views

- Pages:
- 7
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 9 Adaboost Lecturer Sanmi Koyejo Scribe Chaoyun Chen Sep 26th 2017 Recap 1 Decision Tree Idea of Decision Tree Construct learners by using axis align split in regions How to build a tree The function we used to grow a tree based on every node j t argminj 1 d mint j cost xi yi xij t cost xi yi xij t where j and t are dimensions and thresholds j is a set of all thresholds for dimension j We will check and minimizing the cost of left split and right split examples of cost functions Entropy Gini index Error rate How to make prediction For binary case we can apply majority vote and get regions Rm as shown below 1 2 9 Adaboost Figure 1 Decision Tree Region Split For multi class we can still apply majority vote and determine the region by the most popular class For regression we can use square loss How to optimize a decision tree Decision Tree tends to have low bias and high variance similar to nearest neighbor classifier Some approaches could be applied to improve these properties Pruning get a smaller tree by removing leaves from the tree Figure 2 Decision Tree Pruning Why This approach could reduce over fitting increase bias by merging regions together Testing and validation error will measure the degree of over fitting that will help us determine whether an edge removal will help our prediction How Choosing from evaluation metric Because we care accuracy and error rate for the overall model we could check them on the merging or separating validation set Randomized averaging bagging Use random forest to reduce bias and variance will cover this in next few weeks 9 Adaboost 3 2 Random classifier Let h be a random classifier such as that h x 1 1 P 0 5 P 0 5 The error rate is 0 5 we can prove it as below n Errn hi 1X 1 yi 6 h xi n i 1 1 ACCn hi n 1X 1 1 yi 1 1 h xi 1 1 yi 1 1 h x 1 n i 1 1 0 5 0 5 1 0 5 where 1 n Pn i 1 1 yi 1 3 Boosting Idea of Boosting Learn a strong classifier by stage wise combining weak classifier The keywords here are stage wise combining and weak classifier Weak Classifier we usual mean its error rate less than and close to 0 5 which is the error rate of random classifier Examples of weak classifier Decision Stump which is a decision tree with only one level This is a very simple classifier and make slightly better prediction than random classifier 4 9 Adaboost Figure 3 Example of Decision Stump Stage wise The idea of accumulate classifiers one by one with weight is given by f x k X m m x m 1 For the equation above and are the weight and weak classifier We train the model stage wise that means train and fit classifier from 1 to k At the mth stage the loss function Lm could be written as logarithm loss exponential loss and square loss We will discuss more on exponential loss n exp loss f Dn 1 X yi f xi e n i 1 Based on the exponential loss function we can get a boost function called Ada Boost Ada Boost is short for Adaptive Boosting and it is the first boosting algorithm as presented below Lm n X exp yi fm 1 xi xi 1 i 1 In the equation above fm 1 x x is the exponential loss function Aside from the original equation here is the detailed elaboration 9 Adaboost 5 f1 1 1 x f2 f1 2 2 x f3 f2 3 3 x f1 2 2 x 3 3 x 1 1 x 2 2 x 3 3 x fn k X m m x m 1 The goal at m step is to find and that minimizing Lm We can rewrite formula 1 as below Lm n X wi m exp yi xi i 1 wi m exp yi fm 1 xi Because both y and x can be 1 1 their product is either 1 or 1 based on their equality Then rewrite it as Lm e X wi m e y x e e X X wi m y6 x wi m 1 yi xi e n X i 1 Thus we can calculate the parameters and m argmin wi m 1 yi xi m 1 1 errorm log 2 errorm m 2 m Pn errorm i 1 wi m 1 yi 6 xi Pn i 1 wi m wi m 6 9 Adaboost Step by step of Ada Boost binary Let wi 1 N For m from 1 to k do the following Fit a classifier m x by using weighted loss wi m Compute error Compute m log 1 error error Set new weight wi 1 wi exp m 1 yi 6 xi Return f x sign k X m m x m 1 In Practice works best when m are weak classifiers Useful for reducing bias and variance Maximizing the margin Margin is the distance from data point to decision boundary Works well in Go to method for machine learning competitions particular XG Boost a greedy algorithm 4 Empirical Risk Minimization minh H R h Dn Square Loss The risk function is n R h Dn 1X yi h xi 2 n i 1 9 Adaboost 7 if h x wT x then it is equivalent to linear regression The insight is we start from two different direction and end up in the same problem as from probabilistic algorithm and loss function We can rewrite loss function in terms of y h the graph is shown below Figure 4 Loss functions for binary classification For the loss function above exponential loss is written as e yh x and log loss is written as log 1 e yh x Surrogate Loss Functions R h Dn has a property that if we have enough data then the result of minimizing is same as minimizing this function at population level minh H R h Dn n minh H R h P This will be covered more on next lecture

View Full Document