Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

A Guide to Data Wrangling for Data Science Projects

9/18/2017

1 Comment

 

​200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis:
cholera was spread by microbes in the water.  One of the more interesting data points he had to work with was  the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.


Picture
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.

Below is a walk through an exploratory data analysis of UN malnutrition data.

Data Collection:

Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.

Data Storage:

Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.

Accessing the Data:
  • Write SQL queries to extract data from tables in a relational database
  • Load the data into a Pandas dataframe using iPython or Jupyter Notebooks
  • Data Wrangling with R
  • SparkSQL

Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
  • Load the data into a pandas dataframe
  • Develop an understanding of the structure of the data set. Get summary statistics of the data set. What is the mean, median, and mode of the data? What are the maximum and minimum values of a given variable? What is the spread of the data? ​
Picture
  • Are there any missing values in the data set? What percentage of data is missing for a given variable? Is this a data entry error? Might there be a correlation between these missing values and the dependent variable? Is there a value that could be used to accurately replace the missing values?
  • Are all your data types in the expected format? For example, are any of the numeric variables strings instead of floats or integers? Are the data structures compatible with your model? a form that is compatible with your model? For example, are there string categorical variables that should be made numeric dummy variables before being inputted into the model. ​
Picture
  • Use different kinds of plots to visualize the data. Python’s matplotlib library is great for data visualizations. Visualization will sometimes reveal insightful patterns/trends in the data. It will also help pinpoint any existing outliers in the data set.

Using Graphs for EDA, Global Malnutrition Case Study
  • Average malnutrition has been steadily decreasing over the last two decades. ​
Picture
What country has the highest malnutrition levels? What has been the malnutrition trend in this country? ​
Picture
Picture
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Picture
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.

Additional Resources:
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?

1 Comment
Nicolas link
3/3/2021 02:11:17 pm

Good readingg your post

Reply



Leave a Reply.

    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners