200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis: cholera was spread by microbes in the water. One of the more interesting data points he had to work with was the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.
Below is a walk through an exploratory data analysis of UN malnutrition data.
Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.
Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.
Accessing the Data:
Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
Using Graphs for EDA, Global Malnutrition Case Study
What country has the highest malnutrition levels? What has been the malnutrition trend in this country?
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?
Guest written by Rebecca Njeri!
What is a Decision Tree?
Let’s start with a story. Suppose you have a business and you want to acquire some new customers. You also have a limited budget, and you want to ensure that, in advertising, you focus on customers who are the most likely to be converted.
How do you figure out who these people are? You need a classification algorithm that can identify these customers and one particular classification algorithm that could come in handy is the decision tree. A decision tree, after it is trained, gives a sequence of criteria to evaluate features of each new customer to determine whether they will likely be converted.
To start off, you can use data you already have on your existing customers to build a decision tree. Your data should include all the customers, their descriptive features, and a label that indicates whether they converted or not.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label.
Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction or information gain from the split.
Classification problems for decision trees are often binary-- True or False, Male or Female. However, decision trees can also be used to solve multi-class classification problems where the labels are [0, …, K-1], or for this example, [‘Converted customer’, ‘Would like more benefits’, ‘Converts when they see funny ads’, ‘Won’t ever buy our products’].
Using Continuous Variables to Split Nodes in a Decision Tree
Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain.
For example if we wanted to classify quarterbacks versus defensive ends on the Seahawks team using weight, 230 pounds would probably be more appropriate as a boundary than 150 pounds. Trivial fact: the average weight of a quarterback is 225 pounds, while that of a defensive end is 255 pounds.
What is Entropy/Information Gain?
Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. The goal of the decision tree is to result in a set that minimizes impurity. To go back to our story, we start with a set of the general population that may see our ad. The data set is then split on different variables until we arrive at a subset where everyone in that subset either buys the product or does not by the product. Ideally, after traversing our decision tree to the leaves, we should arrive at pure subset - every customer has the same label.
Advantages of Decision Trees
Disadvantages of Decision Trees
Pruning is a method of limiting tree depth to reduce overfitting in decision trees. There are two types of pruning: pre-pruning, and post-pruning.
Pre-pruning a decision tree involves setting the parameters of a decision tree before building it. There a few ways to do this:
To post-prune, validate the performance of the model on a test set. Afterwards, cut back splits that seem to result from overfitting noise in the training set. Pruning these splits dampens the noise in the data set.
*Post-pruning may result in overfitting the model
*Post-pruning is currently not available in Python’s scikit learn, but it’s available in R.
Creating ensembles involves aggregating the results of different models. Ensemble decision trees are used in bagging and random forests, while ensemble regression trees are used in boosting.
Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.
Consequently, the decision trees created are made using different samples which solves the problem of overfitting to the training sample. Ensembling decision trees in this way helps reduce the total error because variance of the model continues to decrease with each new tree added without an increase in the bias of the ensemble.
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.
Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.
Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees.
If it isn’t clear thus far, for many real-world applications a single decision tree is not a preferable classification as it is likely to overfit and generalize very poorly to new examples. However, an ensemble of decision or regression trees minimizes the overfitting disadvantage and these models become stellar, state of the art classification and regression algorithms.
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
Fundamentals of Machine Learning for Predictive Data Analytics
A Guide To Designing A Data Science Project
Top 8 Python Programming Languages for Machine Learning
Basic Statistics For Data Scientists
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!