One of the current common projects for companies to take on today are the migrations. Not just from one database system to another. Like a recent project where we migrated Oracle databases to MS SQL Server, Oracle to PostgreSql or MS SQL server to MYSQL, etc. But also, migrating servers from local hardware to cloud based systems like AWS and Azure.
The goal of these projects are simple.
Reduce costs, and increase the ability to spin up servers and databases on a whim.
Databases are not cheap to licenses when you use products like Oracle or MS SQL server. Add on top of that all of the Oracle and Microsoft's one off costs for every tool and doo-hicky a company can add on and the price starts to become overwhelming.
Databases no longer have to cost an arm and a leg and with the ability to reduce costs further by not just converting RDBMS (relationship database management systems) but also migrating to AWS.
In addition, the ease at which a general user can spin up and down a server on AWS allows for much more agile project development. This makes it easier to go from prototype to final product without dealing with as much bureaucracy. That is, if the company even had enough space on their servers.
Back when companies had to manage a lot more of their own servers. If a company ran out of space on their current server racks, they had to go through the process of buying a new server.
That means getting approval, putting in a PO, waiting for the server, configuring it, securing it, and then putting it online. This was not only expensive. It could take weeks, months...maybe even a year or two depending on the pace of the company(let’s not even get started on a discussion on whether or not the server was needed. Often times, there was plenty of space on a server somewhere..just no one knew it existed).
Now, if a new server is needed, depending on the internal processes, it could be a quick approval a way from being spun up. A new database just a statement away.
Both offer significant advantages in savings.
However, database migrations and conversions are technologically complicated and intense projects! They require experts in database management, security and project managers to ensure the end result is secure and acts 100% the same as the previous set of objects.
So why would a company do it?
Over the next few articles we will be discussing how to convert various databases like Oracle and MS SQL to other options that can be free and just as effective.
We will also be listing out the benefits to switching to a cloud system.
At the end of the day, all of this will help reduce an IT department's cost substantially.
If you need help with any services such as data migrations or converting one RDBMS to another, our team would be happy to help! We have many members who have done all forms of data conversions and migrations.
If you want to read more about databases, data science and how to manage great data teams, then check out the articles below.
Should Our Team Invest In A Data Warehouse?
How To Survive Corporate Politics As A Data Scientist
8 Great Libraries For Machine Learning
Creating A Better Algorithm With Boosting and Bagging
Guest Written By Rebecca Njeri
Last Thursday, I attended the machinery.ai conference in Seattle, WA, and got to listen to talks by Machine Learning experts that ranged from Machine Thinking to Integrating Data Science into Legacy Products. After about 1.5 years of learning and practising data science, this conference reminded me of the things that intrigued me when I first started learning data science, and I thought that I should write a post explaining the three different groups of machine learning algorithms.
Machine Learning can be defined as the science of getting computers to act without being explicitly programmed. It can be further divided into three broad categories: supervised learning, unsupervised learning, and reinforcement learning. A machine learning model should be chosen depending on the nature of the data available as will be illustrated below.
Asish Bansal premised his talk, Machine Thinking, by stating that not all business problems need a machine learning or deep learning solution. He argued that most business problems have a software engineering solution, and later, if need be, a machine learning or deep learning solution can be developed. To illustrate his point, he used the “FizzBuzz in TensorFlow interview” example where Joel Grus codes, as a joke, a TensorFlow solution to the fizzbuzz problem.
Bansal’s talk reminded me of the importance of the business understanding and data understanding parts of the CRISP-DM process. Understanding the kind of data available: numbers, words, images, or voice data, labelled versus unlabeled, will determine what kind, if any, machine learning algorithm is the appropriate solution.
The main goal of supervised learning is to learn a model from labeled training data that allows us to make predictions about unseen or future data(Python Machine Learning, 3). Supervised learning can be divided into two categories depending on the outcome. If the outcome is a continuous value, we have a regression model, and if the outcome is discrete class labels, there is a classification model. There can be both binary classification models and multi-class classification models.
The simplest example of a regression problem is y = mx + c, where a univariate independent variable x is correlated with a dependent variable y, and an equation can be fit to known values, and used to predict unknown values of y given x. Another example of a regression problem, to once more borrow from the machinery.ai talks, is how long a person’s commute how will take given a labelled training set that has weather information and time of day as the independent variables, and commute times as the associated response variable.
Commonly occurring examples of binary classification problems in business analytics include: whether a customer churn or not churn, whether a lead convert or not, whether a transaction is fraud, whether an email is spam or not, among others.
Multi-class classification problems are similar to binary classification problems except there are more than two class labels. An example of this can be a classification of the different demographics of people who frequent a bookstore where labels can include: children under five, teens, young adults, adults, etc. Clearly segregating the shoppers can facilitate more efficient marketing campaigns and help the store’s bottom line.
In reinforcement learning, the goal is to develop a system that improves its performance based on interactions with the environment. The term reinforcement learning is actually borrowed from psychology which refers to any “stimulus which strengthens or increases the probability of a specific response. For example, if you want your dog to sit on command you may give him a treat every time he sits for you.”
For a machine learning example, when a self driving car takes a sharp turn too fast and moves outside its lane, it learns to adjust its speed the next time it takes that turn to ensure it stays within its lane. A reinforcement learning model improves its performance because it learns as it interacts with its environment.
Unsupervised learning is machine learning where there is unlabeled data or data of unknown structure. Examples of unsupervised learning algorithms include clustering and dimensionality reduction such as Principal Component Analysis. The model tries to learn patterns and correlations within the data on its own. Without an associated response variable Y, the goal is to “discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations?”
If the bookstore problem was presented without the class labels of the shoppers, a clustering algorithm could be fit to the data to separate the shoppers into different groups.
Almost every data science talk I have listened to underlines the fact that majority of data science work is data mining and data cleaning before any machine learning models can be built. In fact, most supervised and unsupervised learning algorithms are available in Python’s sklearn library, in RStudio, or some other form of open source software. Ultimately, an intimate understanding of the data that is available, and the implementation of the different machine learning algorithms, is necessary to leverage the power of supervised, unsupervised, and reinforcement learning.
Andrew Ng’s Machine Learning Class on Coursera
Just for gags: Alexa And Google Home Are Scheming Against Apple's HomePod
Read More Data Science and Machine Learning Blog Posts
Creating A Better Algorithm With Boosting and Bagging
How To Survive Corporate Politics As A Data Scientists
Statistics Review For Data Scientists
A Guide To Starting A New Data Science Project
How To Grow A Data Science Team
Web scraping and utilizing various APIs are great ways to collect data from websites and applications that can later be used in data analytics. There is a company called HiQ that is well known for web scraping. HiQ crawls various "Public" websites to collect data and provide analytics for companies on their employees. They help companies find top talent using sites data like Linkedin, and other public sources to gain the information needed in their algorithms.
However, they ran into legal issues when Linkedin asked them to cease and desist as well as put in certain technical methods to slow down HiQ's web crawlers. HiQ subsequently sued Linkedin and won! The judge said as long as the data was public, it was scriptable!
This was quiet the blow for scrapers in general.
So how can your company take advantage of online public data? Especially when your team might not have a programming background.
Image from commit strip (Here)
Web scraping typically requires a complex understanding of HTTP requests, faking headers, complex Regex statements, HTML parsers, and database management skills.
There are programming languages that make this much easier such as Python. This is because Python offers libraries like Scrapy and BeautifulSoup that make scraping and parsing HTML easier than old school web scrapers.
However, it still requires proper design and a decent understanding of programming and website architecture.
Let's say your team does not have programming skills. That is ok! One of our team members recently gave a webinar at Loyola University to demonstrate how to scrape web pages without programming. Instead, Google sheets offer several useful functions that can help scrape web data. If you would like to see the video of our webinar it is below. If not, you can continue to read and figure out how to use Google Sheets to scrape websites.
The functions you can use for web scraping with google sheets are:
All of these functions will scrape websites based off of different parameters provided to the function.
Web Scraping With ImportFeed
The ImportFeed Google Sheet function is one of the easier functions to use. It only requires access to Google Sheets and a URL for a rss feed. This is a feed that is typically associated with a blog.
For instance, you could use our RSS feed "http://www.acheronanalytics.com/2/feed".
How do you use this function? An example is given below.
That is all that is needed! There are some other tips and tricks that can help clean up the data feed as you will get more than just one column of information. For now, this is a great start at web scraping.
Do The Google Sheet Import Functions Update?
All of these import function automatically update data every 2 hours. A trigger function can be set to increase the cadence of updates. However this requires more programming.
This is it in this case! From here, it is all about how your team uses it! Make sure you engineer a solid data scraping system.
Web Scraping With ImportXML
The ImportXML function in Google Sheets is used to pull out specific data points using HTML ids, and classes. This requires some understanding of HTML and parsing XML. This can be a little frustrating. So we created a step by step for web scraping for HTML.
Here are some examples from an EventBrite page.
The truth about using this function is that it requires a lot of time. Thus, it requires planning and designing a good google sheet to ensure you get the maximum benefit from utilizing. Otherwise, your team will end up spending time maintaining it, rather than working on new things. Like in the picture below
Web scraping With ImportHTML
Finally we will discuss ImportHTML. This will import a table or list from a web page. For instance, what if you want to scrape data from a site that contains stock prices.
We will use the http://www.nasdaq.com/symbol/snap/real-time. There is a table on this page that has the stock prices from the past few days.
Similar to the past functions you need to use the URL. On top of the URL, you will have to mention which table on the webpage you want to grab. You can do this by utilizing the which number it might be.
An example would be ImportHTML("http://www.nasdaq.com/symbol/snap/real-time",6). This will scrape the stock prices from the link above.
In our video above, we also show how we combine scraping the stock data above and melded it with news about the Stock ticker on that day. This could be utilized in a much more complex manner. A team could create an algorithm that utilizes the stock price of the past, as well as new articles and twitter information to choose whether to buy or sell stocks.
Do you have any good ideas of what you could do with web scraping? Do you need help with your web scraping project? Let us know!
Other great read about data science:
What is A Decision Tree
How Algorithms Can Become Unethical and Biased
Intro To Data Analysis For Everyone Part 1
Why Invest In A Data Warehouse?
200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis: cholera was spread by microbes in the water. One of the more interesting data points he had to work with was the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.
Below is a walk through an exploratory data analysis of UN malnutrition data.
Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.
Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.
Accessing the Data:
Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
Using Graphs for EDA, Global Malnutrition Case Study
What country has the highest malnutrition levels? What has been the malnutrition trend in this country?
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?
Guest written by Rebecca Njeri!
What is a Decision Tree?
Let’s start with a story. Suppose you have a business and you want to acquire some new customers. You also have a limited budget, and you want to ensure that, in advertising, you focus on customers who are the most likely to be converted.
How do you figure out who these people are? You need a classification algorithm that can identify these customers and one particular classification algorithm that could come in handy is the decision tree. A decision tree, after it is trained, gives a sequence of criteria to evaluate features of each new customer to determine whether they will likely be converted.
To start off, you can use data you already have on your existing customers to build a decision tree. Your data should include all the customers, their descriptive features, and a label that indicates whether they converted or not.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label.
Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction or information gain from the split.
Classification problems for decision trees are often binary-- True or False, Male or Female. However, decision trees can also be used to solve multi-class classification problems where the labels are [0, …, K-1], or for this example, [‘Converted customer’, ‘Would like more benefits’, ‘Converts when they see funny ads’, ‘Won’t ever buy our products’].
Using Continuous Variables to Split Nodes in a Decision Tree
Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain.
For example if we wanted to classify quarterbacks versus defensive ends on the Seahawks team using weight, 230 pounds would probably be more appropriate as a boundary than 150 pounds. Trivial fact: the average weight of a quarterback is 225 pounds, while that of a defensive end is 255 pounds.
What is Entropy/Information Gain?
Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. The goal of the decision tree is to result in a set that minimizes impurity. To go back to our story, we start with a set of the general population that may see our ad. The data set is then split on different variables until we arrive at a subset where everyone in that subset either buys the product or does not by the product. Ideally, after traversing our decision tree to the leaves, we should arrive at pure subset - every customer has the same label.
Advantages of Decision Trees
Disadvantages of Decision Trees
Pruning is a method of limiting tree depth to reduce overfitting in decision trees. There are two types of pruning: pre-pruning, and post-pruning.
Pre-pruning a decision tree involves setting the parameters of a decision tree before building it. There a few ways to do this:
To post-prune, validate the performance of the model on a test set. Afterwards, cut back splits that seem to result from overfitting noise in the training set. Pruning these splits dampens the noise in the data set.
*Post-pruning may result in overfitting the model
*Post-pruning is currently not available in Python’s scikit learn, but it’s available in R.
Creating ensembles involves aggregating the results of different models. Ensemble decision trees are used in bagging and random forests, while ensemble regression trees are used in boosting.
Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.
Consequently, the decision trees created are made using different samples which solves the problem of overfitting to the training sample. Ensembling decision trees in this way helps reduce the total error because variance of the model continues to decrease with each new tree added without an increase in the bias of the ensemble.
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.
Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.
Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees.
If it isn’t clear thus far, for many real-world applications a single decision tree is not a preferable classification as it is likely to overfit and generalize very poorly to new examples. However, an ensemble of decision or regression trees minimizes the overfitting disadvantage and these models become stellar, state of the art classification and regression algorithms.
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
Fundamentals of Machine Learning for Predictive Data Analytics
A Guide To Designing A Data Science Project
Top 8 Python Programming Languages for Machine Learning
Basic Statistics For Data Scientists
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientists at Galvanize Seattle. One student who came to hear our talk was Rebecca Njeri. Below, she shares tips on how to design a Data Science project.
To Begin, Brainstorm Data Project Ideas
To begin your data science project, you will need an idea to work on. To get started, brainstorm possible ideas that might interest you. During this process, go as wide and as crazy as you can, don’t censor yourself. Once you have a few ideas, you can narrow down to the most feasible/interesting idea. You could brainstorm ideas around these prompts:
Questions To Help You Think Of Your Next Data Science Projects
Write a proposal:
Write a proposal along the Cross Industry Standard Process for Data Mining (CRISP DM standards) which has the following steps:
What are the business needs you are trying to address? What are the objectives of the Data Science project? For example, if you are at a telecommunications company, that needs to retain its customers, can you build a model that predicts churn? Maybe you are interested in using live data to help better predict what coupons to offer what customers at the grocery store.
What kind of data is available to you? Is it stored in a relational or NoSQL database? How large is your data? Can it be stored and processed on your hard drive or will you need cloud services? Are the any confidentiality issues or NDAs involved if you are working in partnership with a company or organization? Can you find a new data set online that you could merge and increase your insights.
This stage involves doing a little Exploratory Data Analysis and thinking about how your data will fit into the model that you have. Is the data in data types that are compatible with the model? Are there missing values or outliers? Are these naturally occurring discrepancies or errors that should be corrected before fitting the data into a model? Do you need to create dummy variables for categorical variables? Will you need all the variables in the data set are some dependent on each other?
Choose a model and tune the parameters before fitting it to your training set of data. Python’s scikit learn library is a good place to get model algorithms. With larger data, consider using Spark ML.
Withhold a test set of data to evaluate the model performance. Data Science Central has a great post on different metrics that can be used to measure mode performance. The Confusion Matrix can help with considering the cost-benefit implications of the model’s performance.
Deployment and implementation are some of the key components of any data driven project. You have to get past the theory and algorithms and actually integrate your data science solution into the larger environment.
Flask and bootstrap are great tools to help you deploy your data science project to the world.
Planning Your Data Science Projects
Keep a timeline with a To Do, In Progress, Completed and Parking section. Have a self-scrum(lol) each morning to see what you accomplished the previous day and set a goal for the new day. It could also help to get a friend with whom to scrum and help you keep track of your metrics. Goals and metrics can help you hold yourself accountable and ensure that you actually follow through and get your project done.
Track your Progress
Create a github repo for your project. Your proposal can be incorporated as the read me. Commit your work at the frequency which makes you comfortable, and keep track of how much progress you are making on your metrics. A repo will also make it easier to show your code to friends/mentors for a code review.
Knowing When to Stop Your Project
It may be good to work on your project with a minimum viable product in mind. You may not get all the things on your To Do list accomplished, but having an MVP can help you know when to stop. When you have learned as much as you can from a project, even if you don’t have the perfect classification algorithm, it may be more worthwhile to invest in a new project.
Some Examples Of Data Driven Projects
Below are some links to Github repos of some Data Science Capstones:
Predicting Change in Rental Price Units in NYC
All the best with your new Data Science project! Feel free to reach out if you need someone to help you plan your new project.
Want to be further inspired on your next data driven project!
Check out some of our other data science and machine learning articles. You never know what might inspire you.
Practical Data Science Tips
Creatively Classify Your Data
25 Tips To Gain New Customer
How To Grow Your Data Science Or Analytics Practice
Come join our team of data scientists and machine learning experts as we discuss ethical machine learning at DAML (Data Analytics Machine Learning ) at Redfin. Our presentation will be followed by Josh Poduska is a Senior Data Scientist in HPE’s Big Data Software Group. Who will be discussing Machine Learning on Distributed Systems.
We are very excited for the opportunity to present and can’t wait to see you guys there! It is 100% free and food is provided. Free data science and machine learning talks + free food? What more do you need!
Click Here To RSVP to DAMLs Machine Learning Talk on August 24th For Free
Ethical Machine Learning
Non-technical companies are slowly finding ways to increase their business value using the increased speed of computing and statistics. The problem is, business has always been more concerned about increasing the bottom line, vs. social impact. It is one thing when we joke about large e-commerce sites selling us that extra toaster. But what about when companies that have products that have been proven harmful reach out to data scientists and attempt to have them develop systems that increase the profit for a product that has a negative social impact, or when companies use data science to manipulate the customer, rather than benefit them. Should we? Is it right to forget about the social impact just to make an extra dollar?
Machine Learning on Distributed Systems
Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Feel free to read some of our other blog posts as well!
Best Python Libraries for Machine Learning
Automating Your Data Science Workflow
Should We Start A Data Science Team?
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
How do Machine Learning Algorithms Learn Bias?
There are funny mishaps that result from imperfectly trained machine learning algorithms. Like my friend’s iPhone classifying his dog as a cat. Or these two guys stuck on a voice activated elevator that doesn’t understand their accent. Or maybe Amazon’s Alexa trying to order hundreds of dollhouses because it confuses the news anchor’s report for a request from its owner. There are also the memes on the Amazon Whole Foods purchase, which are truly in the spirit of defective algorithms.
“Bezos: "Alexa, buy me something from Whole Foods."
Alexa: "Buying Whole Foods."
Bezos: "Wait, what?"”
The Data Science Capstone
For my final capstone for the Galvanize Data Science Immersive, I spent a lot of time exploring the concept of algorithmic bias.
I had partnered with an organization that helps former inmates go back to school, and consequently lowers their probability to recidivate. The task I had was to help them figure out the total cost of incarceration, i.e. both the explicit and implicit costs of someone being incarcerated.
While researching this concept, I stumbled upon Propublica’s Machine Bias essay that discusses how risk assessment algorithms contain racial bias. I learnt that an algorithm that returns disproportionate false positives for African Americans is being used to sentence them to longer prison sentences and deny them parole, that tax dollars are being spent on incarcerating people who would be out in the society being productive members of the community, and that children whose parents shouldn’t be in prison are in the foster care system.
An algorithm that has disparate impact is causing people to lose jobs, their social networks, and ensuring the worst cold start problem once someone has been released from prison. At the same time, people likely to commit crimes in the future are let to go free because the algorithm is blind to their criminality.
How do these false positives and negatives occur and does it matter? To begin with, let us define three concepts related to the Confusion Matrix: precision, recall, and accuracy.
Precision is the percentage of correctly classified true positives as a percentage of the positive predictions. High precision means that you correctly label as many of the true positives as possible. For example, a medical diagnostic tool should be very precise because not catching an illness can cause an illness to worsen.
In such a time sensitive situation, the goal is to minimize the number of false negatives returned. Similarly, if a security breach from one of your employees is pending, you’d like a precise model to predict who the culprit will be to ensure that a) You stop the breach, and b) have the minimal interruptions to your staff trying to find this person.
Recall on the other hand is the percentage of relevant elements returned. For example, if you search for Harry Potter books on Google, recall will be the number of Harry Potter titles returned divided by seven.
Ideally we will have a recall of 1. In this case, it might be a nuisance, and a terrible user experience to sift through irrelevant search results. Additionally, if a user does not see relevant results, they will likely not make any purchases, which eventually could hurt the bottom line.
Accuracy is a measure of all the correct predictions as a percentage of the total predictions. Accuracy does poorly as a measure of model performance especially where you have unbalanced classes.
For precision, recall, accuracy, and confusion matrices to make sense to begin with, the training data should be representative of the population such that the model learns how to classify correctly.
Confusion matrices are the basis of cost-benefit matrices, aka the bottom line. For a business, the bottom line is easy to understand through profit and loss analysis. I suppose it’s a lot more complex to determine the bottom line where discrimination against protected classes is involved.
And yet, perhaps it is more urgent and necessary to do this work. There is increased scrutiny on the products we are creating and the biases will be visible and have consequences for our companies.
Machine Learning Bias Caused By Source Data
The largest proportion of machine learning is collecting and cleaning the data that is fed to a model. Data munging is not fun, and thinking about sampling and outliers and population distributions of the training set can be boring, tedious work. Indeed, machines learn bias from the oversights that occur during data munging.
With 2.5 exabytes of data generated every day, there is no shortage of data on which to train our models. There are faces of different colors, with and without glasses, wide eyes and narrow eyes, brown eyes and green eyes.
There are male and female voices, and voices with different accents. Not being culturally aware of the structure of the data set can result in models that are blind or deaf to certain demographics thus marginalizing part of our use groups. Like when Google mistakenly tagged black faces as an album of gorillas. Or when air bags meant to protect passengers put women at risk of death during an accident. These false positives, i.e. the conclusion that you will be safe when you will actually be at risk cost people’s lives.
Earlier this year, one of my friends, a software engineer asked the career adviser if it would be better to use her gender neutral middle name for her resume and LinkedIn to make her job search easier. Her fear isn’t baseless; there are unsurmountable conscious and unconscious gender biases at the workplace. There was even a case where a man and woman switched emails for a short period and saw drastic differences in the way they were being treated.
How to Reduce Machine Learning Bias
However, if we are to teach machines to crawl LinkedIn and resumes, we have the opportunity to scientifically remove the discrimination we humans are unable to overcome. Biased risk assessment algorithms result from models being trained on data that is historically biased. It is possible to intervene and address the historical biases contained in the data such that the model remains aware of gender, age and race without discriminating against or penalizing any protected classes.
The data that seeds a reinforcement learning model can lead to drastically excellent or terrible results. Exponential improvement, or exponential depreciation could lead to increasingly better performing self driving cars that improve with each new ride, or they could convince a D.C. man of the truth of a non-existent sex trafficking ring in D.C.
How do machines learn bias? We teach machines bias through biased training data.
If you enjoyed this piece on data science and machine learning. Feel free to check out some of our other works!
Why Data Science Projects Fail
When Data Science Implementation Goes Wrong
Data Science Consulting Process
Recently, our team of data science consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
We love the fact that that her project was not just technically challenging, but that it was geared towards a bigger purpose than selling toasters or keeping customers from quitting your telecommunication plan! She also brought up her experience interviewing for data science roles at Microsoft and other large corporations and how it taught her so much. We wanted to share what she learned so we asked if she would write us a guest post! And she said yes! So without further ado, here is
How to Prepare for a Data Science Interview:
If you are here, you probably already have a Data Science interview scheduled and are looking for tips on how to prepare so you can crush it. If that’s the case, congratulations on getting past the first two stages of the recruitment pipeline. You have submitted an application and your resume, and perhaps done a take home test. You’ve been offered an interview and you want to make sure you go in ready to blow the minds of your interviewers and walk away with a job offer. Below are tips to help you prepare for your technical phone screens and on-site interviews.
Read the Job Description for the Particular Position You are Interviewing for
Data Scientist roles are still pretty new and the responsibilities vary wildly across industries and across companies. Look at the skills required and the responsibilities for the particular position you are applying for. Make sure that the majority of these are skills that you have, or are willing to learn. For example, if you know Python, you could easily learn R if that’s the language Data Scientists at Company X use. Do you care for web-scraping and inspecting web pages to write web-crawlers? Does analyzing text using different nlp modules excite you? Do you mostly want to write queries to pull dataca from SQL and NoSQL databases and analyse/build models based on this data? Set yourself up for success by leveraging your strengths and interests.
Review your Resume before each Stage of the Interviewing Process
Most interviews will start with questions about your background and how that qualifies you for the position. Having these things at the tip of your fingers will allow you allow you to ease into the interview calmly as you won't be fumbling for answers. Use this time to calm your nerves before the technical questions begin.
Additionally, review your projects and be prepared to talk about the Data Science process you used to design your project. Think about why you chose the tools that you used, the challenges that you encountered along the way, and the things that you learned along the way.
Look at GlassDoor for Past Interview Questions
If you are interviewing for a Data Scientist role at one of the bigger companies, chances are they’ve already interviewed other people before you, who may have shared these questions on GlassDoor. Read them, solve them, get a feel of the questions you will be asked. If you cannot find previous questions for a particular company, solve the data science questions from other companies. They are similar, or at the very least, correlated.
Moreover, even if there are no data science questions for that particular company, see what kind of behavioral questions are asked.
Ask the Recruiter about the Structure of the Interview
Recruiters are often your point of contact with the company you are interviewing at. Ask the recruiter questions about how your interview will be structured, what resources you should use when preparing for your interview, what you should wear to the interview, and even the names of your interviewers so you can stalk look them up on LinkedIn and see their areas of specialization.
Do Mock Interviews
Interviewing can be nerve-racking, more so when you have to whiteboard technical questions. If possible, ask for mock interviews from people who have been through the process before so you know what to expect. If you cannot find someone to do this for you, solve questions on a white board or notebook so you get the feel of writing algorithms some place other than your code editor.
Practice asking questions to understand the scope and constraints of the problem you are solving. Once you are hired, you will not be a siloed data scientist. It is reasonable to bounce around ideas and see if you are on the right track. It is not always about getting the correct answer, which often does not exist, but about how you think through problems, and how you work with other people as well.
Practice the Skills that you Will be Tested On
Your preparation should be informed by the job description and the conversation with recruiters. Study the topics that you know will be on the interview. Look up questions for each area in books and online. Review your statistics, machine learning algorithms, and programming skills.
Additionally, Spring Board has compiled a list of 109 commonly asked Data Science Questions.
KDnuggets also has a list of 21 must know Data Science Interview Questions and Answers.
Follow Up with Thank You Emails
This is probably standard etiquette for any interview but remember to send a personalized thank you email within 24 hours of your interview. Also, if you have thought of the perfect answer to that question you couldn't solve during your interview, include it as well. Don’t forget to express your enthusiasm for the work that Company X does and your desire to work for them.
If you get an offer after your first round of data science interviews, Congratulations! Close this tab and grab a beer. If you are turned down, like most of us are, use the lessons you learned from your past interviews to prepare for your next interviews. Interviews are a good way to identify your areas of weakness, and consequently become a better candidate for future openings. It’s important to stay resilient, patient, and keep a learner’s mindset. Statistically, you probably won't get an offer for each position you apply for. Like the excellent data scientist you are, debug your interviewing process and up your future odds.
Other Great Data Science Blog Posts To Help Make You A Better Data Scientist!
How To Ensure Your Data Science Teams And Projects Succeed!
Why And How To Convince Your Executives To Invest in A Data Science Team?
Data science projects fail all the time! Why is that? Our team of data science consultants have seen many good intentions go wrong because of failure to empower data science teams, locking away access to data, focusing on the wrong problem, and many other problems that could be avoided! We have written 32 of the reasons we have seen data science projects fail. We are sure there are more and would love to get comments on what your teams have seen! What makes a data science project team succeed?
1. The data scientists aren’t given a voice
Data science and strategy can play very nicely together when allowed! Data scientists are more than just over glorified analysts! They have access to possibly all the data a company owns! That means they know every movement the company has made with every outcome (if the data was stored correctly). However, they are often left in the basement with the rest of the tech teams forced to push out reports like any other report developer. There is a reason companies like Amazon, and Google continue to do so well! It is because the people with the Data have a voice!
2. Starting with the wrong questions.
Let’s face it. Most technology people often focus more on how cool a project is, not how much money it will save the company. This can sometimes lead to the wrong business questions being answered! This will lead to a team quickly either failing, or losing value inside of the company. The goal should be to do as much to hit high value business targets as possible. That is what keeps data science projects from failing or at least, being unnoticed.
3.Not addressing the root cause just trying to improve the effect of a process
One of the most dubious and hard to spot until it is too late is not realizing a data science team wasn’t even looking at the actual cause of the problem. When our data science team comes in, one of the things we assess is how a data science team develops their hypotheses. How far do they dig in the data, how many false hypotheses do they think of. How about other causations that could cause a similar output. An outcome can have a very deep root.
4. Weak stakeholder buy-in
Any project, data science, machine learning, construction, or any other department will fail without stakeholder buy in! There needs to be an executives to own the project. This gives a team acknowledgement for their hard work and it also ensures that there will be funding! Without funding, a project will come to a dead halt.
5.Lack of access to data
Slightly attached to the previous point. Locking access away from data scientists, whether it be tools or data is just a waste of time. If a data scientists is forced to spend all day begging a DBAs for access, don’t expect projects to finish any time soon!
6. Using Faulty/Bad Data
Any data specialist (data engineer, analyst, scientist, architect) will tell all managers the cliche saying. Garbage in, garbage out! If the data science team trains a machine learning model on bad data, then it will get bad results. There is no way around it! Even if an algorithm works with 100% accuracy, if all of the data classification is incorrect, then so are the predictions. This will lead to a failed project and executives no longer trusting the data science team.
7.Relying on Excel as the main data storage….or Access
As data science consultants, our team members have come across plenty of analytics and data science projects. Often times, because of lack of support, data scientists and analyst have to construct make shift storage centers because they are not given a sandbox or server to work on. Excel and Access both have their purposes. One of them is not managing large sets of data for analytics purposes. Don’t do that to a data scientists. This will just get poorly designed systems and high turn over!
8. Having a data scientist build their own ETLs
We have seen ETL systems built from R because instead of getting an expert ETL developer a company was allowing the poor data scientists a crack at it. Don’t get us wrong, data scientists are smart people. However, you would much rather have them focus on algorithms and machine learning program implementations instead of spending all day engineering their own data warehouses.
9. Lack of diverse Subject Matter Experts
Data scientists are great with data and often a few subjects that revolve around the data they have worked with. However, data, and businesses are so very different. Sometimes this means a company needs to partner the data science experts with experts. Otherwise, they won’t have the context to better understand complex subjects like manufacturing, pharmaceuticals and avionics.
10.Poorly assessing a team's skills and knowledge of data science tools
If a data science team doesn’t have the skills to work with Hadoop, why would you set up a cluster? It is always good to be aware of a teams skill set first. Otherwise they won’t be able to produce products and solutions at the highest level. Data science tools vary, so make sure you look round before you make any solid decisions.
11.Using technologies because they are cool and not useful
Just because you can use certain tools for a problem. Doesn’t mean it is always the best option. We wouldn’t recommend R for every problem. It is great for research type problems that don’t need to be implemented. If you want a project to get implemented into a larger system, than python or even C++ might be better(depending on the system). Same things goes for Hadoop, or MySQL, or Tableau and Power BI. They all have a place. Don’t let a team do something, just because they can.
12. Lacking an experienced data science leader
Data science is still a new field. That doesn’t mean you don’t need a leader who has some experience working on a data science team. Without one that has a basic understanding of good data science practices. A data science team could struggle to bring projects to fruition. They won’t have a roadmap for success, they will have bad processes and this will just lead to a slew of other problems.
13. Hiring a scientists with limited business understanding
Technology and business are two very different disciplines and sometimes this leads to employees knowing one subject really well and failing to know the other at all. This is ok if a small percentage of the data science team are built up of purely research based employees. It is important to note that some of them should still be very knowledgable of how to act in a business. If you want to help them get up to speed quickly. Check out this list of “How To Survive Corporate Politics as a data scientist”.
14. A boss read one of our blog posts and now thinks he can solve world hunger
Algorithms can’t solve every problem, at least not easily! If this were true, a lot more problems would be solved by now. Having a boss who simply went to a data science conference and now believes he or she can push the data science team to solve every business gap is not reasonable. Limited resources, complexity of subjects, and unstable processes can quickly destroy any project.
15. The solutions are too complex
One mistake executives and data scientists make is thinking their data science models should be complex. It makes sense right, data science is a complex, statistics based subject. This is not true all the time! The simpler you can build a model, or integrate a machine learning solution means a data team will have an easier time maintaining the algorithm in the future.
16. Failing to document
Most technology specialist dislike documentation. It takes time, and it isn’t building new solutions. However, without good documentation, they will never remember what they did 1 month ago, let alone a year ago. This means tracking bugs, tracking how programs work, common fixes, play books, the whole nine yards. Just because data science teams aren’t technically software engineering teams, it doesn’t mean they can step away from documenting how their algorithms work and how they can to their conclusions.
17. The Data science team went with every new request from stakeholders(scope creep).
As with any project, data science teams are susceptible to scope creep. Their stakeholders demand new features every week. They add new data points, and dashboard modules. Suddenly, the data science project seems like it can never be finished. You have half a team focused on a project that managers can’t make their minds up on. Then it will never succeed.
18. Poorly designed models that are not robust or maintainable :
Even well documented bad systems lead to quick failures. Data science projects have lots of moving pieces. Data flowing through ETLs, dashboards, websites, automated report, QA suites, and so one. Any piece of these can take a while to develop, and if developed badly even longer to fix! Nothing is worse then spending an entire FTE on maintaining systems that should be able to run automatically. So spend enough time planning up front that you are not stuck with terrible legacy code.
19. Disagreement on enterprise strategy.
When it comes down to it. Data science offers a huge advantage when implemented well for corporate strategy. That also means the projects being done by some of the more experienced data scientists need to closely align with a directors and executives strategy. Strategies change, so these projects need to come out fast and be focused on maximizing the decisions making of executives. If you are producing a dashboard focused on growth, but an executive team is trying to focus on rebranding, you are wasting time and money!
20. Big data silos or vendor owned data!
You know what is terrible. When data is owned by a vendor. This makes it so hard for data science teams to actually analyze their companies data. Especially if the vendor offers a bad API, none at all or worse, they charge you just to use it. To get a company's data! Imagine, a poor data science budget going to buy back the data! Similarly, if all the data is in silos. It is almost impossible for a data scientists to bring it all together. There are rarely crosswalks or data standards so they are often stuck hopelessly starring at lots of manual work to make data relate.
21 . Problem avoidance(Ignoring the elephant in the room!)
We have all done it! Even data scientists! We know the company has a major problem, it’s the elephant in the room and it could be solved. However, it might be part of company culture, or a problem that no one discusses because it is like the emperor with new clothes. This is sometimes the best place for a data science team to focus.
22. The data science team hasn’t built trust with stakeholders
Let’s be honest. Even if a team develops a 100% accurate algorithm with accurate data, if a team has not been working to build executive trust the entire time, then the project will fail. Why, because every actionable insight a project provides will be questioned, and never implemented.
23. Failing to communicate the value of the data science project
One of the problems our data science consultant team has seen is teams failing to explain the value of a project. This requires...data! You have to use financial numbers, resources saved, competitive advantage gained, etc. To prove to the executives why the project is worth it! The data scientists, use that to help prove their point!
24. Lack of a standardized data science process
No matter how good the data scientists are, without some form of standardization, a team will eventually fail. This may be because a team has to scale and can’t or because a team member leaves. All of this will cause a once working machine to fail.
25. If You Failed To Plan, Plan to Fail
When it comes down to it. There needs to be some amount of planning in the data science projects. You can’t just attempt to find some data sources, make assumptions, attempt to implement some new piece of software without first analyzing the situation! This might take a few weeks and the executives should give you this. If they really want a sustainable piece of software.
26. The data science team competes with other departments(rather than working together)
For some reason or another, office politics exist. Data scientists can often accidently walk over every other department because they are placed in position to help develop strategies and dashboards for the entire company. This might take away jobs from other analysts completely. In turn, this might start fights. So make sure the data science team shares and shows how their projects are helping rather than hurting!
27. Allowing company bias to form conclusions before the data scientists start
Data bias does exist! As a data scientist you can make algorithms and data say whatever you want them to sometimes. However, that doesn’t make it true. Make sure you don’t go into the project with a biased hypothesis that will push you towards early conclusions that might be incorrect.
28. Try to take on to large of a first project
Reading the news about what Google and Facebook are doing with their algorithms may tempt the data science team to take on too large of a project for their first projects. This will not lead to success. You might be lucky and succeed. However, you are taking a huge risk!
29. Manually classifying data
One part of data science that not everyone talks about is data classification. Not just using SVM and KNN algorithms. Nope, we mean actually labeling what the data represents. Someone human has to do that first. Otherwise, the computer will never know how to. If you don’t have a plan on how to classify the data before it gets to the data science team, then someone will have to manually do that. That is one quick way to lose data scientists and have projects fail.
30. Failing to understand what went wrong
Data science projects don’t always succeed. The data science team needs to be able to explain why. As long as it wasn’t a huge drop in the capital budget executives should understand. After all, projects do fail, it is natural. That doesn’t give you an excuse to not know why.
31. Wait to seek out outside help until it is too late
Sometimes the data science team is short on staff, other times you just need new insight. Whatever it might be. The data science team needs to make sure it seeks outside help sooner rather than later. Putting off for help when you know you need it will just lead to awkward conversation with management. They might not want to spend the money, but they also want a project to succeed.
32. Fail to provide actionable insights and opinions
Finally, the data science teams data science project needs to provide actual insight, something actionable. Simply providing a correlation, doesn’t do any good. Executives need decisions, or data to make decisions. If you don’t give them that, you might as well not have a data science team.
If you have any questions, please feel free to comment below! Let us know how we can help!
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!