DOC PREVIEW
ILLINOIS CS 446 - 092117.3

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 8 : BoostingLecturer: Sanmi Koyejo Scribe: Yuyang Rao, Sep. 21th, 20171. Recap1.1 Map (Maximum a-posteriori)•MAP is an estimate of an unknown quantity, that equals the mode of the posteriordistribution. It is one of the simplest case of regularization. The model is:θmap= argmaxθP (θ|Dn)= argmaxθP (Dn|θ)P (θ)P (Dn)= argmaxθP (Dn|θ)P (θ)= argmaxθlogP (Dn|θ) + logP (θ)1.2 Variable Selection• Forward and backward selection:log P (θ) ∝ ||θ||0• LASSO: Least absolute shrinkage and selectionlog P (θ) ∝ ||θ||1• Error function:Error = g(bias, variance)• Example, if error function is defined as:Error [(y − f(x))2] = bias2+ variance– Algorithm 1: Bias = 4, V ariance = 2; then, Error = 42+ 2 = 18– Algorithm 2: Bias = 3, V ariance = 5; then, Error = 32+ 5 = 1412 8 : Boosting2. Decision Tree (CART)2.1 CART•Classification and regression trees or CART models, also called decision trees (not tobe confused with the decision trees used in decision theory) are defined by recursivelypartitioning the input space, and defining a local model in each resulting region ofinput space. This can be represented by a tree, with one leaf per region, as we explainbelow. Kevin (2012).• Could be difficult to label boundary examples.– decide on training data– adapt boxes to data• Box may contain a mixture of data. eg. positive and negative data.• If no data in the box, may not make decision.• Training cost depends on number of tests or number of splits.• Prediction speed is O(depth of tree).• Momory to the model ∝ size of the tree(number of tests).• Why do we care?– Worst case is that model size = training data.– In general, model size << training data.• Bias and Variance•In general, bias of a decision tree is higher than Nearest neighbor classifiers (NN) andthe variance of a decision tree is lower than NN.• Inductive Bias.– Data is axis separable.– Label partitioned in regions.• Scalable.• Robust to outlier.• Easy to mix discrete and continuous features.8 : Boosting 32.2 Growing a Tree•Finding the optimal partitioning of the data is NP-complete (Hyafil and Rivest 1976),so it is common to use the greedy procedure shown in Algorithm 6 to compute a locallyoptimal MLE. This method is used by CART, (Breiman et al. 1984) C4.5(Quinlan1993), and ID3 (Quinlan 1986), which are three popular implementations of the method.Kevin (2012).(See dtfit for a simple Matlab implementation.).To grow a decision tree,we need:– (a) Dimension that we are splitting.– (b) Thresholds and regions.•The split function chooses the best feature, and the best value for that feature, asfollows:(j∗, t∗) = arg minj∈1,....,Dmint∈Tjcost(xi, yi: xij≤ t) + cost(xi, yi: xij> t)•where the cost function for a given dataset will be defined below. For notationalsimplicity, we have assumed all inputs are real-valued or ordinal, so it makes sense tocompare a feature xij to a numeric valuet. The set of possible thresholdsTjfor featurejcan be obtained by sorting the unique values ofxij. For example, if feature 1 has thevalues[4.5,12,72,12], then we setT1= [12,4.5,72]. In the case of categorical inputs,the most common approach is to consider splits of the formxij=ckandxij6=ck, foreach possible class labelck. Although we could allow for multi-way splits (resultingin non-binary trees), this would result in data fragmentation, meaning too little datamight fall into each subtree, resulting in overfitting. Kevin (2012).2.3 Cost Functions•In the classification setting, there are several ways to measure the quality of a split.First, we fit a multinoulli model to the data in the leaf satisfying the testXj< tbyestimating the class-conditional probabilities as follows:πc=1nnXi=11[yi=c]There are several common error measures for evaluating a proposed partition:•(a) Misclassification rate. We define the most probable class label asyc=argmaxcπc.The corresponding error rate is then1nnXi=11[yi6=f(xi)]= 1 − πy4 8 : Boosting• (b) Entropy, or deviance:H(π) = −cXc=1πclogπcNote that minimizing the entropy is equivalent to maximizing the information gain• (c) InforGain:infoGain = H(P arent) − H(split)= H(Y ) − H(Y |Xj< t)= (−Xp(y = c)logp(y = c)) + (Xp(y = c|Xj< t)logp(c|Xj< t))• (d) Gini index:CXc=1πc(1 − πc) =Xcπc−Xcπ2c= 1 −Xcπ2cThis is the expected error rate. To see this, note thatπcis the probability a randomentry in the leaf belongs to class c, and 1−πcis the probability it would be misclassified.Figure 1: Node impurity measures for binary classification. The horizontal axis correspondsto p, the probability of class 1. The entropy measure has been rescaled to pass through(0.5,0.5). Based on Figure 9.3 of (Hastie et al. 2009). Figure generated by giniDemo.Kevin (2012).•In the figure above, we see that the cross-entropy and Gini measures are very similar,and are more sensitive to changes in class probability than is the misclassification rate.8 : Boosting 53. Boosting(Binary Classifier)Boosting (Schapire and Freund 2012) is a greedy algorithm for fitting adaptive basis-functionmodels. The algorithm works by applying the weak learner sequentially to weighted versionsof the data, where more weight is given to examples that were misclassified by earlier rounds.• Weak ClassifierErrorRate(f) =1nnXn=11[y=f(x)]suppose random classifierf(x) =(1 w.p 0.5−1 w.p 0.5ErrorRate(random classif ier) = 0.5E(f ) = EXEY/X[1[y=1]1[f(x)=−1]+ 1[y=−1]1[f(x)=1]]whereEY/X[1[y=1]] =Xy1[y=1]P (y|x) = P (Y = 1|X)In general, for weak classifier, error rate is less than 0.5. And it is slightly better thanrandom classifier.• Step-wise– Suppose we have m different classifiers: f1, f2, .....fm.Step 1: Learn f1Step 2: Learn f2|f1f =12(f1+ f2)Step 3: Learnf3|f2, f1f =13(f1+ f2+ f3)..Step m: Learn fm|fm−1fm−2...f1f =1mPmm=1fm– Prediction: sign (Pmm=1fm) where fm= αm, φm(X)–This shows that error rate is better than random classifier. Boosting can achieveANY accuracy depending on number of weak classifiers. Boosting was originallyderived in the computational learning theory literature (Schapire 1990; Freundand Schapire 1996), where the focus is binary classification. In these papers, it6 8 : Boostingwas proved that one could boost the performance (on the training set) of anyweak learner arbitrarily high, provided the weak learner could always performslightly better than chance. For example, in Figure 2, we plot the


View Full Document

ILLINOIS CS 446 - 092117.3

Download 092117.3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 092117.3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 092117.3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?