Guest written by Rebecca Njeri!
What is a Decision Tree?
Let’s start with a story. Suppose you have a business and you want to acquire some new customers. You also have a limited budget, and you want to ensure that, in advertising, you focus on customers who are the most likely to be converted.
How do you figure out who these people are? You need a classification algorithm that can identify these customers and one particular classification algorithm that could come in handy is the decision tree. A decision tree, after it is trained, gives a sequence of criteria to evaluate features of each new customer to determine whether they will likely be converted.
To start off, you can use data you already have on your existing customers to build a decision tree. Your data should include all the customers, their descriptive features, and a label that indicates whether they converted or not.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label.
Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction or information gain from the split.
Classification problems for decision trees are often binary-- True or False, Male or Female. However, decision trees can also be used to solve multi-class classification problems where the labels are [0, …, K-1], or for this example, [‘Converted customer’, ‘Would like more benefits’, ‘Converts when they see funny ads’, ‘Won’t ever buy our products’].
Using Continuous Variables to Split Nodes in a Decision Tree
Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain.
For example if we wanted to classify quarterbacks versus defensive ends on the Seahawks team using weight, 230 pounds would probably be more appropriate as a boundary than 150 pounds. Trivial fact: the average weight of a quarterback is 225 pounds, while that of a defensive end is 255 pounds.
What is Entropy/Information Gain?
Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. The goal of the decision tree is to result in a set that minimizes impurity. To go back to our story, we start with a set of the general population that may see our ad. The data set is then split on different variables until we arrive at a subset where everyone in that subset either buys the product or does not by the product. Ideally, after traversing our decision tree to the leaves, we should arrive at pure subset - every customer has the same label.
Advantages of Decision Trees
Disadvantages of Decision Trees
Pruning is a method of limiting tree depth to reduce overfitting in decision trees. There are two types of pruning: pre-pruning, and post-pruning.
Pre-pruning a decision tree involves setting the parameters of a decision tree before building it. There a few ways to do this:
To post-prune, validate the performance of the model on a test set. Afterwards, cut back splits that seem to result from overfitting noise in the training set. Pruning these splits dampens the noise in the data set.
*Post-pruning may result in overfitting the model
*Post-pruning is currently not available in Python’s scikit learn, but it’s available in R.
Creating ensembles involves aggregating the results of different models. Ensemble decision trees are used in bagging and random forests, while ensemble regression trees are used in boosting.
Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.
Consequently, the decision trees created are made using different samples which solves the problem of overfitting to the training sample. Ensembling decision trees in this way helps reduce the total error because variance of the model continues to decrease with each new tree added without an increase in the bias of the ensemble.
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.
Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.
Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees.
If it isn’t clear thus far, for many real-world applications a single decision tree is not a preferable classification as it is likely to overfit and generalize very poorly to new examples. However, an ensemble of decision or regression trees minimizes the overfitting disadvantage and these models become stellar, state of the art classification and regression algorithms.
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
Fundamentals of Machine Learning for Predictive Data Analytics
A Guide To Designing A Data Science Project
Top 8 Python Programming Languages for Machine Learning
Basic Statistics For Data Scientists
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!