DOC PREVIEW
UT Knoxville STAT 201 - Decision Trees

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Decision Trees2A World of Data Companies have been collecting information on variables of interest for years creating huge data sets. Baseball teams have data on players and prospective players. Grocery stores have data on the buying habits of their consumers. Colleges have data on their students. If you’re in business, you’re in the business of data.3Why Collect Data? Often times there is a question we wish to answer. Will a player be successful in the majors? How can we increase the average sale in our grocery store? Why do students fail out of college? The answers to these questions (and more) might be contained within data.4What is a Decision Tree? A decision tree is a graphical display of data being segmented. After multiple segmentations are created, the graph begins to somewhat resemble a tree. Computers use complex algorithms to find the best splits in the data. Understanding and interpreting these splits is the job of the statistician.5Decision Trees (Continued) Decision trees can handle both categorical and quantitative data. Identifier variables should not be used in decision trees. They have too many levels. Some categorical variables act similar to identifier variables if they have a lot of levels. These should also be excluded from the decision tree.6Sleuthing Through the Data Statistical software contains powerful tools to mine through data and find possible relationships.  The more data we collect, the more tests we can run. The more tests we run, the more likely we are to find results. The more results we find, the more decisions we will make. More decisions means a higher chance of a type I or type II error.7Creating the Decision Tree Our y variable is the variable of interest that we wish to explain (Response Variable).  Our x variables are the explanatory variables. The decision tree allows us to enter in multiple x variables to try to explain the one y variable. Variables are entered in one by one as we create splits.8Decision Trees in JMP The partitioning tool is used to make decision trees in JMP. This can be found under Analyze->Modeling->Partition.9Decision Trees in JMP Next we need to add in our variables. The picture below shows a basic decision tree trying to describe gender.10Decision Trees in JMP The decision tree creates partitions in the data to divide it on the most significant explanatory variable. The data shows there are 730 people in our sample and slightly over half are female.11The First Split The first split is always on the most significant explanatory variable. It will often explain the greatest percent of variation in our y variable. Height greater than or equal to 70 inches (6 foot 2 inches) explains 43.7% of the variation in gender.12The End of the Tree There are two possible ways a decision tree ends. After so many splits we run out of the required data needed to create a split. We decide to stop creating splits. As mentioned before, the more splits we create the less likely they are to be significant explanatory variables. The picture to the right shows the 15thand 18thsplit trying to predict gender. Do you think views on tailgating and Gangnum Style are truly significant?13Additional Splits Each additional split usually explains less and less variation in our y variable. Two reasons account for this: There is less total variation to explain in y. The variables are usually less significant. Look at the R2value on the right as we create more splits.14Checking the R2We have to make our own judgment when to end. When the increases in R2are tiny, we need to stop making splits.15Interpreting the Tree The interpretation is very similar to our regression interpretation. ____R2____% of the variation in ____y____ is explained by _the variables in the tree_. R2 describes variation explained in y by the xs y is the response variable xs are the explanatory variables  This interpretation can get very long if we have a lot of variables in the model.16Interpreting the Splits Under the red arrow click “leaf report” The report shows us the percentage and count within each split. Each split is labeled with the details of the split. 79.98% of people who are between 67 to 70 inches and own jorts (Jean shorts) are female.17Saving Predictions from Tree We can use the decision tree to save predictions for the y variable. A quantitative variable will save a prediction.  A categorical variable will save a probability for the different levels of the y variable.18Things to Consider Creating a good model starts with collecting the right data. New splits are contained within old splits. The original split creates a path for the decision tree to follow. Sometimes it is best to create multiple trees with first splits to get an idea of what truly governs your variable of interest.19Example in Business Many businesses have massive data sets. The data set to the left has 1152 variables! The data is financial data from a bank that has been coded. A decision tree allows us to find out what x variable explains the most variation in y.20What Leads to Deposits? We want to explain variation in total deposits. In the past statisticians had to run each test individually. In moments we’ve found a variable that explains 60.5% of the variation in total deposits.21Examining the R2 Examining the increase in R2can help us decide how many splits we need. The increase after the third split is very small. This seems like the split to stop on.22Using the Model Output for our final model is displayed below. What does R2tell us? Why would this model be useful to a bank? How could we create a better


View Full Document

UT Knoxville STAT 201 - Decision Trees

Documents in this Course
Load more
Download Decision Trees
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Decision Trees and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Decision Trees 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?