DOC PREVIEW
UNC-Chapel Hill BIOS 740 - Classification and Regression by randomForest

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Vol. 2/3, December 2002 18Classification and Regression byrandomForestAndy Liaw and Matthew WienerIntroductionRecently there has been a lot of interest in “ensem-ble learning” — methods that generate many clas-sifiers and aggregate their results. Two well-knownmethods are boosting (see, e.g., Shapire et al., 1998)and baggingBreiman (1996) of classification trees. Inboosting, successive trees give extra weight to pointsincorrectly predicted by earlier predictors. In theend, a weighted vote is taken for prediction. In bag-ging, successive trees do not depend on earlier trees— each is independently constructed using a boot-strap sample of the data set. In the end, a simplemajority vote is taken for prediction.Breiman (2001) proposed random forests, whichadd an additional layer of randomness to bagging.In addition to constructing each tree using a differentbootstrap sample of the data, random forests changehow the classification or regression trees are con-structed. In standard trees, each node is split usingthe best split among all variables. In a random for-est, each node is split using the best among a sub-set of predictors randomly chosen at that node. Thissomewhat counterintuitive strategy turns out to per-form very well compared to many other classifiers,including discriminant analysis, support vector ma-chines and neural networks, and is robust againstoverfitting (Breiman, 2001). In addition, it is veryuser-friendly in the sense that it has only two param-eters (the number of variables in the random subsetat each node and the number of trees in the forest),and is usually not very sensitive to their values.The randomForest package provides an R inter-face to the Fortran programs by Breiman and Cut-ler (available athttp://www.stat.berkeley.edu/users/breiman/). This article provides a brief intro-duction to the usage and features of the R functions.The algorithmThe random forests algorithm (for both classificationand regression) is as follows:1. Draw ntreebootstrap samples from the originaldata.2. For each of the bootstrap samples, grow an un-pruned classification or regression tree, with thefollowing modification: at each node, ratherthan choosing the best split among all predic-tors, randomly sample mtryof the predictorsand choose the best split from among thosevariables. (Bagging can be thought of as thespecial case of random forests obtained whenmtry= p, the number of predictors.)3. Predict new data by aggregating the predic-tions of the ntreetrees (i.e., majority votes forclassification, average for regression).An estimate of the error rate can be obtained,based on the training data, by the following:1. At each bootstrap iteration, predict the datanot in the bootstrap sample (what Breimancalls “out-of-bag”, or OOB, data) using the treegrown with the bootstrap sample.2. Aggregate the OOB predictions. (On the av-erage, each data point would be out-of-bagaround 36% of the times, so aggregate thesepredictions.) Calcuate the error rate, and callit the OOB estimate of error rate.Our experience has been that the OOB estimate oferror rate is quite accurate, given that enough treeshave been grown (otherwise the OOB estimate canbias upward; seeBylander (2002)).Extra information from Random ForestsThe randomForest package optionally produces twoadditional pieces of information: a measure of theimportance of the predictor variables, and a measureof the internal structure of the data (the proximity ofdifferent data points to one another).Variable importance This is a difficult concept todefine in general, because the importance of avariable may be due to its (possibly complex)interaction with other variables. The randomforest algorithm estimates the importance of avariable by looking at how much prediction er-ror increases when (OOB) data for that vari-able is permuted while all others are left un-changed. The necessary calculations are car-ried out tree by tree as the random forest isconstructed. (There are actually four differentmeasures of variable importance implementedin the classification code. The reader is referredto Breiman (2002) for their definitions.)proximity measure The (i, j) element of the prox-imity matrix produced by randomForest is thefraction of trees in which elements i and j fallin the same terminal node. The intuition isthat “similar” observations should be in thesame terminal nodes more often than dissim-ilar ones. The proximity matrix can be usedR News ISSN 1609-3631Vol. 2/3, December 2002 19to identify structure in the data (see Breiman,2002) or for unsupervised learning with ran-dom forests (see below).Usage in RThe user interface to random forest is consistent withthat of other classification functions such as nnet()(in the nnet package) and svm() (in the e1071 pack-age). (We actually borrowed some of the interfacecode from those two functions.) There is a formulainterface, and predictors can be specified as a matrixor data frame via the x argument, with responses as avector via the y argument. If the response is a factor,randomForest performs classification; if the responseis continuous (that is, not a factor), randomForestperforms regression. If the response is unspecified,randomForest performs unsupervised learning (seebelow). Currently randomForest does not handleordinal categorical responses. Note that categoricalpredictor variables must also be specified as factors(or else they will be wrongly treated as continuous).The randomForest function returns an object ofclass "randomForest". Details on the componentsof such an object are provided in the online docu-mentation. Methods provided for the class includespredict and print.A classification exampleThe Forensic Glass data set was used in Chapter 12 ofMASS4 (Venables and Ripley, 2002) to illustrate vari-ous classification algorithms. We use it here to showhow random forests work:> library(randomForest)> library(MASS)> data(fgl)> set.seed(17)> fgl.rf <- randomForest(type ~ ., data = fgl,+ mtry = 2, importance = TRUE,+ do.trace = 100)100: OOB error rate=20.56%200: OOB error rate=21.03%300: OOB error rate=19.63%400: OOB error rate=19.63%500: OOB error rate=19.16%> print(fgl.rf)Call:randomForest.formula(formula = type ~ .,data = fgl, mtry = 2, importance = TRUE,do.trace = 100)Type of random forest: classificationNumber of trees: 500No. of variables tried at each split: 2OOB estimate of error rate: 19.16%Confusion matrix:WinF WinNF Veh Con Tabl Head class.errorWinF 63 6 1 0 0 0 0.1000000WinNF 9


View Full Document

UNC-Chapel Hill BIOS 740 - Classification and Regression by randomForest

Download Classification and Regression by randomForest
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Classification and Regression by randomForest and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Classification and Regression by randomForest 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?