DOC PREVIEW
ILLINOIS CS 446 - 092117.2

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 8 : Decision TreesLecturer: Sanmi Koyejo Scribe: Chaoyue Cui, Aug. 21th, 2017Outlines• Recap• CART• Boosting (Ada Boost)RecapMAPMAP estimate is maximum a posterior probability estimate.The posterior distribution ofθ can be derived from Bayes Theorem:P (θ|Dn) =P (Dn|θ)P (θ)P (Dn).Then we can estimateθas the mode of the posterior distribution. In some situations,it will be easier to calculate after taking log of the above equation, and it won’t influencethe result. Besides, the denominator of the above equation is always positive and does notdepend on θ.Wiki (2017b) So we can omit it in the derivation.θ = argmaxθP (θ|Dn) = argmaxθ(log P (Dn|θ) + log P (θ)).Variable SelectionVariable selection, also known as feature selection, attribute selection or variable subsetselection, is the process of selecting a subset of relevant features (variables, predictors) foruse in model construction. Wiki (2017a) Different methods to deal withlog P(θ) in theMAP:12 8 : Decision Trees• logP(θ)∝ kθk0, wherekk0is the pseudo norm. And there are two algorithms: forwardselection and backward selection.• logP (θ) ∝ kθk1, where kk1is the L1norm.Bias and VarianceError can be written as a function of bias and variance:Error = g(bias, variance).Usually there are two cases in the above formula: the bias goes up while the variancegoes down, and vice versa. The value of bias and variance is usually not quite important,and what matters is how much they change relative to each other.Example: Assume that the error function is square loss:E[(y − f (x))2] = bias2+ variance.For one algorithm, the bias is 4, and variance is 3. In another algorithm, the bias is 3 andvariance is 5. In the previous case,E= 18, while in the latter case,E= 14. Even thoughthe variance is small in the first case, the error is larger.Decision Tree (CART)Figure 1: Example 1Suppose in a 2D coordinate, the space is divided into boxes with orthogonal lines paralleltox1andx2axis. There might be several data points in each box, and the label of the data8 : Decision Trees 3points is positive(+) or negative(−). See Figure 1. We want to draw a boundary to separatethe regions with positive labels from regions with negative labels. We can identify each”small” box as positive or negative and then split. However, there are a few issues(listedbelow) needing to be concerned:• We can not label the boundary examples.– Solution 1: Decide on 6 or >.– Solution 2: Adapt boxes to the training data.• Boxes may have mixtures.– Solution: We can use majority voting to decide the label.•It is possible that there is no data in the box. Hence it might be not clear how tomake decisions.To avoid these problems, we can use decision tree to deal with the problem. Decision treesis also called classification and regression trees (abbreviated as ”CART”). It is defined byrecursively partitioning the input space,and defining a local model in each resulting region ofinput space. Murphy (2012) Consider another ”splitting problem” shown in Figure 2. Firstwe can find a line (x1= 3) parallel tox2axis that can split the two kinds of labels. In thecase of the figure, the left part of the line is labeled by +. However, in the right part of theline, there are both + and−. Hence, we we continue to draw another line (x2= 1) parallel tox1axis which separate label + and−. The decisions can be expressed more clearly in Figure 3.Figure 2: Example 24 8 : Decision TreesFigure 3: Decision TreeProperties of decision trees:• Training costs depend on the number of tests (or splits).•The memory needed to store the model is proportional to the size of the decision tree.(Question: Why should we care about the memory? Answer: In the worst case, themodel size id equal to the training data; in general, the model size is much smallerthan the training data.)• The prediction speed depends on the depth of the tree.•The bias of decision tree is higher than the Nearest Neighbor method while the varianceis lower than that of Nearest Neighbor method.•There are two inductive bias for decision tree. Firstly, the data is axis-separable.Secondly, the label (or class) id partitioned in region.• There are some advantages of decision tree.– It is scalable.– It is robust to outliers.– It is easy to mix discrete and continuous features.Growing a Decision TreeWhen we want to grow a decision tree, we need to consider two issues. Firstly, we needto determine the dimension of the space we are going to split. Secondly, for each dimension,we need to decide the threshold to split the region like what we did in the previous example.8 : Decision Trees 5For each dimension, the potential number of threshold is equal to the number of samples.The split function chooses the best feature, and the best value for that feature as followsMurphy (2012):(j∗, t∗) = argminj∈{1,2,...,D}mint∈τjcost({xi, yi: xij6 t}) + cost({xi, yi: xij> t}),whereτjis all thresholds for dimensionj, andcostis some abstract function used to evaluatethe quality of a proposed split. It depends on whether the case is regression or classification.Murphy (2012)In a classification setting, there are many ways to measure the quality of a split. First,we fit a multinoulli model to the data in the leaf satisfying the testXj< tby estimatingthe class-conditional probabilitiesMurphy (2012):πc=1nnXi=1I(yi= c),where n is the number of the data. Give this, there are several error measures for evaluatinga proposed partition:• Misclassification rate (error rate).H(π) =1nnXi=1I(yi6= f(xi)).• Entropy.−1nCXc=1πclog πc.• Information gain.Inf oGain = H(parent) − H(split) = H(Y ) − H(Y |Xj< t).• Gini index. πcis the probability a random entry in the leaf belongs to class c.nXi=1πc(1 − πc).In the two-class case, wherep=πm(1), the misclassification rate is 1max(p,1p), theentropy isH2(p), and the Gini index is 2p(1p). These are plotted in Figure 4. We can seethat the cross-entropy and Gini measures are very similar, and are more sensitive to changesin class probability than is the misclassification rate.Murphy (2012)6 8 : Decision TreesFigure 4: Node impurity measures for binary classification Murphy (2012)BoostingBoosting is to learn a strong classifier by stagewise combining weak classifier. The twokey terms in the definition are ”stagewise combining” and ”weak classifier”.Weak ClassifierSuppose that the error rate for a


View Full Document

ILLINOIS CS 446 - 092117.2

Download 092117.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 092117.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 092117.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?