200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis: cholera was spread by microbes in the water. One of the more interesting data points he had to work with was the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.
Below is a walk through an exploratory data analysis of UN malnutrition data.
Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.
Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.
Accessing the Data:
Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
Using Graphs for EDA, Global Malnutrition Case Study
What country has the highest malnutrition levels? What has been the malnutrition trend in this country?
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!