Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

A Guide to Data Wrangling for Data Science Projects

9/18/2017

1 Comment

 

​200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis:
cholera was spread by microbes in the water.  One of the more interesting data points he had to work with was  the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.


Picture
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.

Below is a walk through an exploratory data analysis of UN malnutrition data.

Data Collection:

Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.

Data Storage:

Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.

Accessing the Data:
  • Write SQL queries to extract data from tables in a relational database
  • Load the data into a Pandas dataframe using iPython or Jupyter Notebooks
  • Data Wrangling with R
  • SparkSQL

Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
  • Load the data into a pandas dataframe
  • Develop an understanding of the structure of the data set. Get summary statistics of the data set. What is the mean, median, and mode of the data? What are the maximum and minimum values of a given variable? What is the spread of the data? ​
Picture
  • Are there any missing values in the data set? What percentage of data is missing for a given variable? Is this a data entry error? Might there be a correlation between these missing values and the dependent variable? Is there a value that could be used to accurately replace the missing values?
  • Are all your data types in the expected format? For example, are any of the numeric variables strings instead of floats or integers? Are the data structures compatible with your model? a form that is compatible with your model? For example, are there string categorical variables that should be made numeric dummy variables before being inputted into the model. ​
Picture
  • Use different kinds of plots to visualize the data. Python’s matplotlib library is great for data visualizations. Visualization will sometimes reveal insightful patterns/trends in the data. It will also help pinpoint any existing outliers in the data set.

Using Graphs for EDA, Global Malnutrition Case Study
  • Average malnutrition has been steadily decreasing over the last two decades. ​
Picture
What country has the highest malnutrition levels? What has been the malnutrition trend in this country? ​
Picture
Picture
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Picture
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.

Additional Resources:
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?

1 Comment

A Brilliant Explanation of Decision Tree Algorithms

9/2/2017

2 Comments

 

Guest written by Rebecca Njeri!
​

What is a Decision Tree?

Let’s start with a story. Suppose you have a business and you want to acquire some new customers. You also have a limited budget, and you want to ensure that, in advertising, you focus on customers who are the most likely to be converted.

How do you figure out who these people are? You need a classification algorithm that can identify these customers and one particular classification algorithm that could come in handy is the decision tree. A decision tree, after it is trained, gives a sequence of criteria to evaluate features of each new customer to determine whether they will likely be converted.

To start off, you can use data you already have on your existing customers to build a decision tree. Your data should include all the customers, their descriptive features, and a label that indicates whether they converted or not.

The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label.

Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction  or information gain from the split.

​
Picture

Classification problems for decision trees are often binary-- True or False, Male or Female. However,  decision trees can also be used to solve multi-class classification problems where the labels are [0, …, K-1], or for this example, [‘Converted customer’, ‘Would like more benefits’, ‘Converts when they see funny ads’,  ‘Won’t ever buy our products’].

Using Continuous Variables to Split Nodes in  a Decision Tree

Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain.

For example if we wanted to classify quarterbacks versus defensive ends on the Seahawks team using weight, 230 pounds would probably be more appropriate as a boundary than 150 pounds. Trivial fact: the average weight of a quarterback is 225 pounds, while that of a defensive end is 255 pounds.

​
Picture

​What is Entropy/Information Gain?


Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. The goal of the decision tree is to result in a set that minimizes impurity. To go back to our story, we start with a set of the general population that may see our ad. The data set is then split on different variables until we arrive at a subset where everyone in that subset either buys the product or does not by the product. Ideally, after traversing our decision tree to the leaves, we should arrive at pure subset - every customer has the same label.  

Advantages of Decision Trees

  • Decision trees are easy to interpret.
  • To build a decision tree requires little data preparation from the user- there is no need to normalize data

Disadvantages of Decision Trees

  • Decision trees are likely to overfit noisy data. The probability of overfitting on noise increases as a tree gets deeper.

Pruning

Pruning is a method of limiting tree depth to reduce overfitting in decision trees. There are two types of pruning: pre-pruning, and post-pruning.

Pre-pruning

Pre-pruning a decision tree involves setting the parameters of a decision tree before building it. There a few ways to do this:
  • Set maximum tree depth
  • Set maximum number of terminal nodes
  • Set minimum samples for a node split:
    • Controls the size of the resultant terminal nodes
  • Set maximum number of features

Post-pruning
To post-prune, validate the performance of the model on a test set. Afterwards, cut back splits that seem to result from overfitting noise in the training set. Pruning these splits dampens the noise in the data set.
*Post-pruning may result in overfitting the model
*Post-pruning is currently not available in Python’s scikit learn, but it’s available in R.


​Ensembles

Creating ensembles involves aggregating the results of different models. Ensemble decision trees are used in bagging and random forests, while ensemble regression trees are used in boosting.

Bagging/Bootstrap aggregating

Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.

Consequently, the decision trees created are made using different samples which solves the problem of overfitting to the training sample. Ensembling decision trees in this way helps reduce the total error because variance of the model continues to decrease with each new tree added without an increase in the bias of the ensemble.

Random Forest
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.

Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.

Boosting

Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees.

If it isn’t clear thus far, for many real-world applications a single decision tree is not a preferable classification as it is likely to overfit and generalize very poorly to new examples. However, an ensemble of decision or regression trees minimizes the overfitting disadvantage and these models become stellar, state of the art classification and regression algorithms.
Additional Resources:
Learning Data Science, Our Favorite Books, Videos And Courses
Statistics And R Specialization With Coursera
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
Fundamentals of Machine Learning for Predictive Data Analytics
A Guide To Designing A Data Science Project
Top 8 Python Programming Languages for Machine Learning
Basic Statistics For Data Scientists

2 Comments
    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners