200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis: cholera was spread by microbes in the water. One of the more interesting data points he had to work with was the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.
Below is a walk through an exploratory data analysis of UN malnutrition data.
Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.
Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.
Accessing the Data:
Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
Using Graphs for EDA, Global Malnutrition Case Study
What country has the highest malnutrition levels? What has been the malnutrition trend in this country?
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?
Guest written by Rebecca Njeri!
What is a Decision Tree?
Let’s start with a story. Suppose you have a business and you want to acquire some new customers. You also have a limited budget, and you want to ensure that, in advertising, you focus on customers who are the most likely to be converted.
How do you figure out who these people are? You need a classification algorithm that can identify these customers and one particular classification algorithm that could come in handy is the decision tree. A decision tree, after it is trained, gives a sequence of criteria to evaluate features of each new customer to determine whether they will likely be converted.
To start off, you can use data you already have on your existing customers to build a decision tree. Your data should include all the customers, their descriptive features, and a label that indicates whether they converted or not.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label.
Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction or information gain from the split.
Classification problems for decision trees are often binary-- True or False, Male or Female. However, decision trees can also be used to solve multi-class classification problems where the labels are [0, …, K-1], or for this example, [‘Converted customer’, ‘Would like more benefits’, ‘Converts when they see funny ads’, ‘Won’t ever buy our products’].
Using Continuous Variables to Split Nodes in a Decision Tree
Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain.
For example if we wanted to classify quarterbacks versus defensive ends on the Seahawks team using weight, 230 pounds would probably be more appropriate as a boundary than 150 pounds. Trivial fact: the average weight of a quarterback is 225 pounds, while that of a defensive end is 255 pounds.
What is Entropy/Information Gain?
Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. The goal of the decision tree is to result in a set that minimizes impurity. To go back to our story, we start with a set of the general population that may see our ad. The data set is then split on different variables until we arrive at a subset where everyone in that subset either buys the product or does not by the product. Ideally, after traversing our decision tree to the leaves, we should arrive at pure subset - every customer has the same label.
Advantages of Decision Trees
Disadvantages of Decision Trees
Pruning is a method of limiting tree depth to reduce overfitting in decision trees. There are two types of pruning: pre-pruning, and post-pruning.
Pre-pruning a decision tree involves setting the parameters of a decision tree before building it. There a few ways to do this:
To post-prune, validate the performance of the model on a test set. Afterwards, cut back splits that seem to result from overfitting noise in the training set. Pruning these splits dampens the noise in the data set.
*Post-pruning may result in overfitting the model
*Post-pruning is currently not available in Python’s scikit learn, but it’s available in R.
Creating ensembles involves aggregating the results of different models. Ensemble decision trees are used in bagging and random forests, while ensemble regression trees are used in boosting.
Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.
Consequently, the decision trees created are made using different samples which solves the problem of overfitting to the training sample. Ensembling decision trees in this way helps reduce the total error because variance of the model continues to decrease with each new tree added without an increase in the bias of the ensemble.
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.
Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.
Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees.
If it isn’t clear thus far, for many real-world applications a single decision tree is not a preferable classification as it is likely to overfit and generalize very poorly to new examples. However, an ensemble of decision or regression trees minimizes the overfitting disadvantage and these models become stellar, state of the art classification and regression algorithms.
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
Fundamentals of Machine Learning for Predictive Data Analytics
A Guide To Designing A Data Science Project
Top 8 Python Programming Languages for Machine Learning
Basic Statistics For Data Scientists
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientists at Galvanize Seattle. One student who came to hear our talk was Rebecca Njeri. Below, she shares tips on how to design a Data Science project.
To Begin, Brainstorm Data Project Ideas
To begin your data science project, you will need an idea to work on. To get started, brainstorm possible ideas that might interest you. During this process, go as wide and as crazy as you can, don’t censor yourself. Once you have a few ideas, you can narrow down to the most feasible/interesting idea. You could brainstorm ideas around these prompts:
Questions To Help You Think Of Your Next Data Science Projects
Write a proposal:
Write a proposal along the Cross Industry Standard Process for Data Mining (CRISP DM standards) which has the following steps:
What are the business needs you are trying to address? What are the objectives of the Data Science project? For example, if you are at a telecommunications company, that needs to retain its customers, can you build a model that predicts churn? Maybe you are interested in using live data to help better predict what coupons to offer what customers at the grocery store.
What kind of data is available to you? Is it stored in a relational or NoSQL database? How large is your data? Can it be stored and processed on your hard drive or will you need cloud services? Are the any confidentiality issues or NDAs involved if you are working in partnership with a company or organization? Can you find a new data set online that you could merge and increase your insights.
This stage involves doing a little Exploratory Data Analysis and thinking about how your data will fit into the model that you have. Is the data in data types that are compatible with the model? Are there missing values or outliers? Are these naturally occurring discrepancies or errors that should be corrected before fitting the data into a model? Do you need to create dummy variables for categorical variables? Will you need all the variables in the data set are some dependent on each other?
Choose a model and tune the parameters before fitting it to your training set of data. Python’s scikit learn library is a good place to get model algorithms. With larger data, consider using Spark ML.
Withhold a test set of data to evaluate the model performance. Data Science Central has a great post on different metrics that can be used to measure mode performance. The Confusion Matrix can help with considering the cost-benefit implications of the model’s performance.
Deployment and implementation are some of the key components of any data driven project. You have to get past the theory and algorithms and actually integrate your data science solution into the larger environment.
Flask and bootstrap are great tools to help you deploy your data science project to the world.
Planning Your Data Science Projects
Keep a timeline with a To Do, In Progress, Completed and Parking section. Have a self-scrum(lol) each morning to see what you accomplished the previous day and set a goal for the new day. It could also help to get a friend with whom to scrum and help you keep track of your metrics. Goals and metrics can help you hold yourself accountable and ensure that you actually follow through and get your project done.
Track your Progress
Create a github repo for your project. Your proposal can be incorporated as the read me. Commit your work at the frequency which makes you comfortable, and keep track of how much progress you are making on your metrics. A repo will also make it easier to show your code to friends/mentors for a code review.
Knowing When to Stop Your Project
It may be good to work on your project with a minimum viable product in mind. You may not get all the things on your To Do list accomplished, but having an MVP can help you know when to stop. When you have learned as much as you can from a project, even if you don’t have the perfect classification algorithm, it may be more worthwhile to invest in a new project.
Some Examples Of Data Driven Projects
Below are some links to Github repos of some Data Science Capstones:
Predicting Change in Rental Price Units in NYC
All the best with your new Data Science project! Feel free to reach out if you need someone to help you plan your new project.
Want to be further inspired on your next data driven project!
Check out some of our other data science and machine learning articles. You never know what might inspire you.
Practical Data Science Tips
Creatively Classify Your Data
25 Tips To Gain New Customer
How To Grow Your Data Science Or Analytics Practice
Come join our team of data scientists and machine learning experts as we discuss ethical machine learning at DAML (Data Analytics Machine Learning ) at Redfin. Our presentation will be followed by Josh Poduska is a Senior Data Scientist in HPE’s Big Data Software Group. Who will be discussing Machine Learning on Distributed Systems.
We are very excited for the opportunity to present and can’t wait to see you guys there! It is 100% free and food is provided. Free data science and machine learning talks + free food? What more do you need!
Click Here To RSVP to DAMLs Machine Learning Talk on August 24th For Free
Ethical Machine Learning
Non-technical companies are slowly finding ways to increase their business value using the increased speed of computing and statistics. The problem is, business has always been more concerned about increasing the bottom line, vs. social impact. It is one thing when we joke about large e-commerce sites selling us that extra toaster. But what about when companies that have products that have been proven harmful reach out to data scientists and attempt to have them develop systems that increase the profit for a product that has a negative social impact, or when companies use data science to manipulate the customer, rather than benefit them. Should we? Is it right to forget about the social impact just to make an extra dollar?
Machine Learning on Distributed Systems
Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Feel free to read some of our other blog posts as well!
Best Python Libraries for Machine Learning
Automating Your Data Science Workflow
Should We Start A Data Science Team?
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
How do Machine Learning Algorithms Learn Bias?
There are funny mishaps that result from imperfectly trained machine learning algorithms. Like my friend’s iPhone classifying his dog as a cat. Or these two guys stuck on a voice activated elevator that doesn’t understand their accent. Or maybe Amazon’s Alexa trying to order hundreds of dollhouses because it confuses the news anchor’s report for a request from its owner. There are also the memes on the Amazon Whole Foods purchase, which are truly in the spirit of defective algorithms.
“Bezos: "Alexa, buy me something from Whole Foods."
Alexa: "Buying Whole Foods."
Bezos: "Wait, what?"”
The Data Science Capstone
For my final capstone for the Galvanize Data Science Immersive, I spent a lot of time exploring the concept of algorithmic bias.
I had partnered with an organization that helps former inmates go back to school, and consequently lowers their probability to recidivate. The task I had was to help them figure out the total cost of incarceration, i.e. both the explicit and implicit costs of someone being incarcerated.
While researching this concept, I stumbled upon Propublica’s Machine Bias essay that discusses how risk assessment algorithms contain racial bias. I learnt that an algorithm that returns disproportionate false positives for African Americans is being used to sentence them to longer prison sentences and deny them parole, that tax dollars are being spent on incarcerating people who would be out in the society being productive members of the community, and that children whose parents shouldn’t be in prison are in the foster care system.
An algorithm that has disparate impact is causing people to lose jobs, their social networks, and ensuring the worst cold start problem once someone has been released from prison. At the same time, people likely to commit crimes in the future are let to go free because the algorithm is blind to their criminality.
How do these false positives and negatives occur and does it matter? To begin with, let us define three concepts related to the Confusion Matrix: precision, recall, and accuracy.
Precision is the percentage of correctly classified true positives as a percentage of the positive predictions. High precision means that you correctly label as many of the true positives as possible. For example, a medical diagnostic tool should be very precise because not catching an illness can cause an illness to worsen.
In such a time sensitive situation, the goal is to minimize the number of false negatives returned. Similarly, if a security breach from one of your employees is pending, you’d like a precise model to predict who the culprit will be to ensure that a) You stop the breach, and b) have the minimal interruptions to your staff trying to find this person.
Recall on the other hand is the percentage of relevant elements returned. For example, if you search for Harry Potter books on Google, recall will be the number of Harry Potter titles returned divided by seven.
Ideally we will have a recall of 1. In this case, it might be a nuisance, and a terrible user experience to sift through irrelevant search results. Additionally, if a user does not see relevant results, they will likely not make any purchases, which eventually could hurt the bottom line.
Accuracy is a measure of all the correct predictions as a percentage of the total predictions. Accuracy does poorly as a measure of model performance especially where you have unbalanced classes.
For precision, recall, accuracy, and confusion matrices to make sense to begin with, the training data should be representative of the population such that the model learns how to classify correctly.
Confusion matrices are the basis of cost-benefit matrices, aka the bottom line. For a business, the bottom line is easy to understand through profit and loss analysis. I suppose it’s a lot more complex to determine the bottom line where discrimination against protected classes is involved.
And yet, perhaps it is more urgent and necessary to do this work. There is increased scrutiny on the products we are creating and the biases will be visible and have consequences for our companies.
Machine Learning Bias Caused By Source Data
The largest proportion of machine learning is collecting and cleaning the data that is fed to a model. Data munging is not fun, and thinking about sampling and outliers and population distributions of the training set can be boring, tedious work. Indeed, machines learn bias from the oversights that occur during data munging.
With 2.5 exabytes of data generated every day, there is no shortage of data on which to train our models. There are faces of different colors, with and without glasses, wide eyes and narrow eyes, brown eyes and green eyes.
There are male and female voices, and voices with different accents. Not being culturally aware of the structure of the data set can result in models that are blind or deaf to certain demographics thus marginalizing part of our use groups. Like when Google mistakenly tagged black faces as an album of gorillas. Or when air bags meant to protect passengers put women at risk of death during an accident. These false positives, i.e. the conclusion that you will be safe when you will actually be at risk cost people’s lives.
Earlier this year, one of my friends, a software engineer asked the career adviser if it would be better to use her gender neutral middle name for her resume and LinkedIn to make her job search easier. Her fear isn’t baseless; there are unsurmountable conscious and unconscious gender biases at the workplace. There was even a case where a man and woman switched emails for a short period and saw drastic differences in the way they were being treated.
How to Reduce Machine Learning Bias
However, if we are to teach machines to crawl LinkedIn and resumes, we have the opportunity to scientifically remove the discrimination we humans are unable to overcome. Biased risk assessment algorithms result from models being trained on data that is historically biased. It is possible to intervene and address the historical biases contained in the data such that the model remains aware of gender, age and race without discriminating against or penalizing any protected classes.
The data that seeds a reinforcement learning model can lead to drastically excellent or terrible results. Exponential improvement, or exponential depreciation could lead to increasingly better performing self driving cars that improve with each new ride, or they could convince a D.C. man of the truth of a non-existent sex trafficking ring in D.C.
How do machines learn bias? We teach machines bias through biased training data.
If you enjoyed this piece on data science and machine learning. Feel free to check out some of our other works!
Why Data Science Projects Fail
When Data Science Implementation Goes Wrong
Data Science Consulting Process
Recently, our team of data science consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
We love the fact that that her project was not just technically challenging, but that it was geared towards a bigger purpose than selling toasters or keeping customers from quitting your telecommunication plan! She also brought up her experience interviewing for data science roles at Microsoft and other large corporations and how it taught her so much. We wanted to share what she learned so we asked if she would write us a guest post! And she said yes! So without further ado, here is
How to Prepare for a Data Science Interview:
If you are here, you probably already have a Data Science interview scheduled and are looking for tips on how to prepare so you can crush it. If that’s the case, congratulations on getting past the first two stages of the recruitment pipeline. You have submitted an application and your resume, and perhaps done a take home test. You’ve been offered an interview and you want to make sure you go in ready to blow the minds of your interviewers and walk away with a job offer. Below are tips to help you prepare for your technical phone screens and on-site interviews.
Read the Job Description for the Particular Position You are Interviewing for
Data Scientist roles are still pretty new and the responsibilities vary wildly across industries and across companies. Look at the skills required and the responsibilities for the particular position you are applying for. Make sure that the majority of these are skills that you have, or are willing to learn. For example, if you know Python, you could easily learn R if that’s the language Data Scientists at Company X use. Do you care for web-scraping and inspecting web pages to write web-crawlers? Does analyzing text using different nlp modules excite you? Do you mostly want to write queries to pull dataca from SQL and NoSQL databases and analyse/build models based on this data? Set yourself up for success by leveraging your strengths and interests.
Review your Resume before each Stage of the Interviewing Process
Most interviews will start with questions about your background and how that qualifies you for the position. Having these things at the tip of your fingers will allow you allow you to ease into the interview calmly as you won't be fumbling for answers. Use this time to calm your nerves before the technical questions begin.
Additionally, review your projects and be prepared to talk about the Data Science process you used to design your project. Think about why you chose the tools that you used, the challenges that you encountered along the way, and the things that you learned along the way.
Look at GlassDoor for Past Interview Questions
If you are interviewing for a Data Scientist role at one of the bigger companies, chances are they’ve already interviewed other people before you, who may have shared these questions on GlassDoor. Read them, solve them, get a feel of the questions you will be asked. If you cannot find previous questions for a particular company, solve the data science questions from other companies. They are similar, or at the very least, correlated.
Moreover, even if there are no data science questions for that particular company, see what kind of behavioral questions are asked.
Ask the Recruiter about the Structure of the Interview
Recruiters are often your point of contact with the company you are interviewing at. Ask the recruiter questions about how your interview will be structured, what resources you should use when preparing for your interview, what you should wear to the interview, and even the names of your interviewers so you can stalk look them up on LinkedIn and see their areas of specialization.
Do Mock Interviews
Interviewing can be nerve-racking, more so when you have to whiteboard technical questions. If possible, ask for mock interviews from people who have been through the process before so you know what to expect. If you cannot find someone to do this for you, solve questions on a white board or notebook so you get the feel of writing algorithms some place other than your code editor.
Practice asking questions to understand the scope and constraints of the problem you are solving. Once you are hired, you will not be a siloed data scientist. It is reasonable to bounce around ideas and see if you are on the right track. It is not always about getting the correct answer, which often does not exist, but about how you think through problems, and how you work with other people as well.
Practice the Skills that you Will be Tested On
Your preparation should be informed by the job description and the conversation with recruiters. Study the topics that you know will be on the interview. Look up questions for each area in books and online. Review your statistics, machine learning algorithms, and programming skills.
Additionally, Spring Board has compiled a list of 109 commonly asked Data Science Questions.
KDnuggets also has a list of 21 must know Data Science Interview Questions and Answers.
Follow Up with Thank You Emails
This is probably standard etiquette for any interview but remember to send a personalized thank you email within 24 hours of your interview. Also, if you have thought of the perfect answer to that question you couldn't solve during your interview, include it as well. Don’t forget to express your enthusiasm for the work that Company X does and your desire to work for them.
If you get an offer after your first round of data science interviews, Congratulations! Close this tab and grab a beer. If you are turned down, like most of us are, use the lessons you learned from your past interviews to prepare for your next interviews. Interviews are a good way to identify your areas of weakness, and consequently become a better candidate for future openings. It’s important to stay resilient, patient, and keep a learner’s mindset. Statistically, you probably won't get an offer for each position you apply for. Like the excellent data scientist you are, debug your interviewing process and up your future odds.
Other Great Data Science Blog Posts To Help Make You A Better Data Scientist!
How To Ensure Your Data Science Teams And Projects Succeed!
Why And How To Convince Your Executives To Invest in A Data Science Team?
Data science projects fail all the time! Why is that? Our team of data science consultants have seen many good intentions go wrong because of failure to empower data science teams, locking away access to data, focusing on the wrong problem, and many other problems that could be avoided! We have written 32 of the reasons we have seen data science projects fail. We are sure there are more and would love to get comments on what your teams have seen! What makes a data science project team succeed?
1. The data scientists aren’t given a voice
Data science and strategy can play very nicely together when allowed! Data scientists are more than just over glorified analysts! They have access to possibly all the data a company owns! That means they know every movement the company has made with every outcome (if the data was stored correctly). However, they are often left in the basement with the rest of the tech teams forced to push out reports like any other report developer. There is a reason companies like Amazon, and Google continue to do so well! It is because the people with the Data have a voice!
2. Starting with the wrong questions.
Let’s face it. Most technology people often focus more on how cool a project is, not how much money it will save the company. This can sometimes lead to the wrong business questions being answered! This will lead to a team quickly either failing, or losing value inside of the company. The goal should be to do as much to hit high value business targets as possible. That is what keeps data science projects from failing or at least, being unnoticed.
3.Not addressing the root cause just trying to improve the effect of a process
One of the most dubious and hard to spot until it is too late is not realizing a data science team wasn’t even looking at the actual cause of the problem. When our data science team comes in, one of the things we assess is how a data science team develops their hypotheses. How far do they dig in the data, how many false hypotheses do they think of. How about other causations that could cause a similar output. An outcome can have a very deep root.
4. Weak stakeholder buy-in
Any project, data science, machine learning, construction, or any other department will fail without stakeholder buy in! There needs to be an executives to own the project. This gives a team acknowledgement for their hard work and it also ensures that there will be funding! Without funding, a project will come to a dead halt.
5.Lack of access to data
Slightly attached to the previous point. Locking access away from data scientists, whether it be tools or data is just a waste of time. If a data scientists is forced to spend all day begging a DBAs for access, don’t expect projects to finish any time soon!
6. Using Faulty/Bad Data
Any data specialist (data engineer, analyst, scientist, architect) will tell all managers the cliche saying. Garbage in, garbage out! If the data science team trains a machine learning model on bad data, then it will get bad results. There is no way around it! Even if an algorithm works with 100% accuracy, if all of the data classification is incorrect, then so are the predictions. This will lead to a failed project and executives no longer trusting the data science team.
7.Relying on Excel as the main data storage….or Access
As data science consultants, our team members have come across plenty of analytics and data science projects. Often times, because of lack of support, data scientists and analyst have to construct make shift storage centers because they are not given a sandbox or server to work on. Excel and Access both have their purposes. One of them is not managing large sets of data for analytics purposes. Don’t do that to a data scientists. This will just get poorly designed systems and high turn over!
8. Having a data scientist build their own ETLs
We have seen ETL systems built from R because instead of getting an expert ETL developer a company was allowing the poor data scientists a crack at it. Don’t get us wrong, data scientists are smart people. However, you would much rather have them focus on algorithms and machine learning program implementations instead of spending all day engineering their own data warehouses.
9. Lack of diverse Subject Matter Experts
Data scientists are great with data and often a few subjects that revolve around the data they have worked with. However, data, and businesses are so very different. Sometimes this means a company needs to partner the data science experts with experts. Otherwise, they won’t have the context to better understand complex subjects like manufacturing, pharmaceuticals and avionics.
10.Poorly assessing a team's skills and knowledge of data science tools
If a data science team doesn’t have the skills to work with Hadoop, why would you set up a cluster? It is always good to be aware of a teams skill set first. Otherwise they won’t be able to produce products and solutions at the highest level. Data science tools vary, so make sure you look round before you make any solid decisions.
11.Using technologies because they are cool and not useful
Just because you can use certain tools for a problem. Doesn’t mean it is always the best option. We wouldn’t recommend R for every problem. It is great for research type problems that don’t need to be implemented. If you want a project to get implemented into a larger system, than python or even C++ might be better(depending on the system). Same things goes for Hadoop, or MySQL, or Tableau and Power BI. They all have a place. Don’t let a team do something, just because they can.
12. Lacking an experienced data science leader
Data science is still a new field. That doesn’t mean you don’t need a leader who has some experience working on a data science team. Without one that has a basic understanding of good data science practices. A data science team could struggle to bring projects to fruition. They won’t have a roadmap for success, they will have bad processes and this will just lead to a slew of other problems.
13. Hiring a scientists with limited business understanding
Technology and business are two very different disciplines and sometimes this leads to employees knowing one subject really well and failing to know the other at all. This is ok if a small percentage of the data science team are built up of purely research based employees. It is important to note that some of them should still be very knowledgable of how to act in a business. If you want to help them get up to speed quickly. Check out this list of “How To Survive Corporate Politics as a data scientist”.
14. A boss read one of our blog posts and now thinks he can solve world hunger
Algorithms can’t solve every problem, at least not easily! If this were true, a lot more problems would be solved by now. Having a boss who simply went to a data science conference and now believes he or she can push the data science team to solve every business gap is not reasonable. Limited resources, complexity of subjects, and unstable processes can quickly destroy any project.
15. The solutions are too complex
One mistake executives and data scientists make is thinking their data science models should be complex. It makes sense right, data science is a complex, statistics based subject. This is not true all the time! The simpler you can build a model, or integrate a machine learning solution means a data team will have an easier time maintaining the algorithm in the future.
16. Failing to document
Most technology specialist dislike documentation. It takes time, and it isn’t building new solutions. However, without good documentation, they will never remember what they did 1 month ago, let alone a year ago. This means tracking bugs, tracking how programs work, common fixes, play books, the whole nine yards. Just because data science teams aren’t technically software engineering teams, it doesn’t mean they can step away from documenting how their algorithms work and how they can to their conclusions.
17. The Data science team went with every new request from stakeholders(scope creep).
As with any project, data science teams are susceptible to scope creep. Their stakeholders demand new features every week. They add new data points, and dashboard modules. Suddenly, the data science project seems like it can never be finished. You have half a team focused on a project that managers can’t make their minds up on. Then it will never succeed.
18. Poorly designed models that are not robust or maintainable :
Even well documented bad systems lead to quick failures. Data science projects have lots of moving pieces. Data flowing through ETLs, dashboards, websites, automated report, QA suites, and so one. Any piece of these can take a while to develop, and if developed badly even longer to fix! Nothing is worse then spending an entire FTE on maintaining systems that should be able to run automatically. So spend enough time planning up front that you are not stuck with terrible legacy code.
19. Disagreement on enterprise strategy.
When it comes down to it. Data science offers a huge advantage when implemented well for corporate strategy. That also means the projects being done by some of the more experienced data scientists need to closely align with a directors and executives strategy. Strategies change, so these projects need to come out fast and be focused on maximizing the decisions making of executives. If you are producing a dashboard focused on growth, but an executive team is trying to focus on rebranding, you are wasting time and money!
20. Big data silos or vendor owned data!
You know what is terrible. When data is owned by a vendor. This makes it so hard for data science teams to actually analyze their companies data. Especially if the vendor offers a bad API, none at all or worse, they charge you just to use it. To get a company's data! Imagine, a poor data science budget going to buy back the data! Similarly, if all the data is in silos. It is almost impossible for a data scientists to bring it all together. There are rarely crosswalks or data standards so they are often stuck hopelessly starring at lots of manual work to make data relate.
21 . Problem avoidance(Ignoring the elephant in the room!)
We have all done it! Even data scientists! We know the company has a major problem, it’s the elephant in the room and it could be solved. However, it might be part of company culture, or a problem that no one discusses because it is like the emperor with new clothes. This is sometimes the best place for a data science team to focus.
22. The data science team hasn’t built trust with stakeholders
Let’s be honest. Even if a team develops a 100% accurate algorithm with accurate data, if a team has not been working to build executive trust the entire time, then the project will fail. Why, because every actionable insight a project provides will be questioned, and never implemented.
23. Failing to communicate the value of the data science project
One of the problems our data science consultant team has seen is teams failing to explain the value of a project. This requires...data! You have to use financial numbers, resources saved, competitive advantage gained, etc. To prove to the executives why the project is worth it! The data scientists, use that to help prove their point!
24. Lack of a standardized data science process
No matter how good the data scientists are, without some form of standardization, a team will eventually fail. This may be because a team has to scale and can’t or because a team member leaves. All of this will cause a once working machine to fail.
25. If You Failed To Plan, Plan to Fail
When it comes down to it. There needs to be some amount of planning in the data science projects. You can’t just attempt to find some data sources, make assumptions, attempt to implement some new piece of software without first analyzing the situation! This might take a few weeks and the executives should give you this. If they really want a sustainable piece of software.
26. The data science team competes with other departments(rather than working together)
For some reason or another, office politics exist. Data scientists can often accidently walk over every other department because they are placed in position to help develop strategies and dashboards for the entire company. This might take away jobs from other analysts completely. In turn, this might start fights. So make sure the data science team shares and shows how their projects are helping rather than hurting!
27. Allowing company bias to form conclusions before the data scientists start
Data bias does exist! As a data scientist you can make algorithms and data say whatever you want them to sometimes. However, that doesn’t make it true. Make sure you don’t go into the project with a biased hypothesis that will push you towards early conclusions that might be incorrect.
28. Try to take on to large of a first project
Reading the news about what Google and Facebook are doing with their algorithms may tempt the data science team to take on too large of a project for their first projects. This will not lead to success. You might be lucky and succeed. However, you are taking a huge risk!
29. Manually classifying data
One part of data science that not everyone talks about is data classification. Not just using SVM and KNN algorithms. Nope, we mean actually labeling what the data represents. Someone human has to do that first. Otherwise, the computer will never know how to. If you don’t have a plan on how to classify the data before it gets to the data science team, then someone will have to manually do that. That is one quick way to lose data scientists and have projects fail.
30. Failing to understand what went wrong
Data science projects don’t always succeed. The data science team needs to be able to explain why. As long as it wasn’t a huge drop in the capital budget executives should understand. After all, projects do fail, it is natural. That doesn’t give you an excuse to not know why.
31. Wait to seek out outside help until it is too late
Sometimes the data science team is short on staff, other times you just need new insight. Whatever it might be. The data science team needs to make sure it seeks outside help sooner rather than later. Putting off for help when you know you need it will just lead to awkward conversation with management. They might not want to spend the money, but they also want a project to succeed.
32. Fail to provide actionable insights and opinions
Finally, the data science teams data science project needs to provide actual insight, something actionable. Simply providing a correlation, doesn’t do any good. Executives need decisions, or data to make decisions. If you don’t give them that, you might as well not have a data science team.
If you have any questions, please feel free to comment below! Let us know how we can help!
As more companies turn towards data science and data consulting to help gain competitive advantage. We are getting more people asking us - Why should we start a data science team? We wanted to take a moment and write down some reasons why your company needs to start a data science team. It is not just a fad anymore, it is becoming a need!
Maintain Competitive Advantage
As more and more companies start integrating data science and better analytics into their day to day strategy. It will no longer be a competitive advantage to make data driven decisions. It will be a necessity. Executives will have to rely on accurate data and sustainable metrics to make constantly correct decisions. If your company moves based on the actual why, vs. the speculations and surface level problems, then they can make a greater and more effective impact internally and externally.
Better Understanding of Current and Possible New Customers
Data science helps give executives and managers a deeper understanding into why customers make the choices they do. Google has had the greatest opportunity as it has almost become a third lobe of some people’s brain. We tell it when we are happy, sad, looking for love, studying, etc. It knows what our questions are. However, Google is not the only one, other companies are beginning to realize that their customers have been telling them their opinions on multiple social platforms and blogs. They just need to go and look and see how to better hear from their customers data.
This is a great opportunity for corporations to seek out what their customers are feeling, about their products, about their companies image and what other stuff they are purchasing. Sometimes this may require purchasing third party data. Even then, this might be worth it depending on the projects being done.
Better Understanding of Internal Operations
Whether it is HR, finance, accounting or operations, data science is helping tie all these fields together and paint a better picture of what is happening in a company. Why are people leaving, why was there so much overtime, how can a company be better at utilizing resources, and so on. These questions can be answered by taking high quality data and blending it to find out the whys. This in turn will provide a better work place environment and increase resource efficiency.
Increases Performance Through Data Driven Decisions
Data science is still a relatively new field. Many companies are still figuring out how to properly implement data science solutions. However, those that do are seeing amazing results. Look no further than Amazon, or Uber. These companies are changing and have changed the way we view certain industries. Why, because they know what their customers want, they know what the industry is doing wrong, and they know how to charge the customer less, but give them more.
Overall, increasing data science and analytics proficiency allows for your executives to trust their data more and make clearer decisions. Consider looking into a team today!
Classifying Data Creatively
In data science we have plenty of classification algorithms. We have SVM, Logistic Regression and Random Forest to name a few. This often requires that the data sets come pre-labeled.
Who does that pre labeling? How are you supposed to get hundreds of thousands of rows labeled if they don’t naturally come that way?
Part of a data scientists job is not just algorithms and models, but also figuring out how the data will be designed and gathered. If your initiative is new, then it is time to put your thinking cap on and come up with some creative ways to collect data. The four examples below will be a great place to start!
Calculate your Classification
In some cases, you are lucky. The label can be calculated off a metric. For instance, one project we worked on involved readmission of patients. The hospitals were trying to reduce their 30-day readmission rate. Now classification is easy. Yes, this patient was readmitted before the 30 day threshold, or no they were not.
These are the easiest, and often the most straightforward. However, this is relying on the fact that the data you have already might already have the signal you are looking for, this is far from always true! This is why you must plan your data science project out ahead of time. You need to ensure your product is already logging the right data points to ensure you can gain insights. Otherwise, why track the data?
Create A Service or Crowd Source
Google and Netflix are not only really good at machine learning, and data science. They are also really good at having consumers do all their labeling and classification for them. Google's recent AutoDraw and Local Guides are great examples where Google gamified the process of data collection. We may think we are simply drawing silly stick images, or helping others find good food. However, Google is able to take all that information and create even more services and tools with it.
Google knows even more information from its search. Perhaps you have a date coming up and you want to know where to take the person. You probably google “Best date places in city”. You just told google you are dating someone, looking for a place to eat in a specific city. Then, you probably click through a few places google recommends, which will let Google guess what salary you might be making and finally, you probably put it into Google maps. The amount of information Google can deduce and classify from this one interaction is pretty large and you gave it freely. Of course, Google isn't alone at creating a service that gathers data.
For those of you who don’t remember the old Netflix queue, you know, back when you used to only get the movies through the mail. There was almost no, if not no product recommendation. So how did they get all the classification data? Simple, we were inclined to put the movies in the order we wanted them to ensure we got the right movie at the right time. With all this data, Netflix was able to provide us better recommendations to users, and now we can't stop watching. Who knows, it might not be long before they will be able to stick scripts into an algorithm and produce award winning movie scripts in minutes.
Why? Because it was convenient and saves money.
Pay for it
There are plenty of companies out there that you can actually purchase data from. It might come in a little messy (we say that from experience), and it won’t be cheap. However, it can be quite beneficial. We have worked in situations where companies bought data from credit card companies. This can help you tie people in real life with the data you have on them in your systems. Maybe you want to know what these people buy, or how much money your average user base is. Well this classification is almost impossible to find unless you send out a survey. Even then, people have to answer honestly. Thus, purchasing data is not a bad way to go.
The hardest part of this method is matching people on both sides of the data. Unless of course you have clear keys like social security numbers, or maybe full names, birthdays and sex. Those are usually the best ways to match two very different data sets of people.
There is one final method of classifying and labeling data. However, it is the most time consuming for your employees.
Have some poor analyst do it
This last example is real, it happens a lot, and you just feel bad for those poor souls. We were just conversing with a data scientist from a local E-Commerce company and they told us how they had a group of analyst labeling specific html data for 36,000 web pages. It is slow and error prone process. Truthfully, this is the last way a company should do it. Sometimes, there are no other options though.
Truthfully, part of the strategy when implementing a data science project should include how the data will be gathered. Sadly, not everyone is Google, Apple or Facebook that track not only our online interactions, but where we go. We take our phones with us everywhere, and they just continue to gather and reap more data.
To Data Science Program Managers and Leadership
If your a data science team program manager, make sure to think about how you're planning to classify your data ahead of time! Sometimes, you just get the data and have to use. However, if you are so lucky to be able to set up your own project. Be creative! Think about scale and how you can use the power of people and users to classify your data for you. You might even be able to charge for your product or service.
If you need help devising a data gathering or integration strategy, we would love to work with you! Check out our data science services or just send us a question. We have data scientists and programmers with all different backgrounds in projects and specialities. We would love to help.
Is your company looking to figure out who should become data scientists and how to start a team? You are not alone, even Amazon and Airbnb are starting internal universities to teach more of their teams the values of data science. Maybe your company needs help setting up some internal classes to help increase your data science an machine learning skill sets. Acheron provides multiple forms of internal education programs. They can be for managers, or analysts. One form is a quick guide to how to run a data science team! This a for managers and executives who are starting, or already have a data science team and want to ensure they are getting the best return on investment from their team and that their team members all feel challenged!
We took one sub section out and wanted to share a common question we get when we talk to executives. Who are data scientists, and who should become one! One such client told us they have loads of scientists, but wasn't sure how to turn them into data scientists, and who in their cohorts should really become one.
Below we will go over some of the top soft skills data scientists should have, and what type of personality should someone have before they enroll in some form of data science program. Whether this be an internal program, or external, like Galvanize, or a university data science certificate. In the end, data science is a skill that companies will need to harness to make sure they can keep up with the rest of their competitors who are already successfully implementing data science into their upper level strategy.
Who are Data Scientists?
Data scientist have to be driven individuals. They not only must be technically savvy, they also need to be proactively aware of their company’s nuances. If they happen to see a correlation or pattern, they will seek out how to access the data required and will bring possible projects up to their manager.
Being driven is great, especially when combined with curiosity. Data scientists love to ask why, and not stop until they find out the root cause. They are great at pinpointing that actual patterns in the noise. This is a necessary skill in order to peel apart the complexity and relationships various data sets may have. Occasionally, an individual may have a curious mind, but may lack the drive to act upon their inquiries.
Tolerance of Failure
Data science has a lot of similarities to the science field. In the sense that there might be 99 failed hypotheses that lead to 1 successful solution. Some data driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations every year to year and a half. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc). This means a data scientists must be willing to fail fast and often. Similar to using the agile methodology. They have to constantly test, retest, and prove that their algorithms are correct.
The term data storyteller has become correlated with data scientist. This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields. This requires taking complex mathematical and technological concepts and creating clear and concise messages that executives can act upon. Not just hiding behind their jargon, but actually transcribing their complex ideas into business speak.
Creative and Abstract Thinking
Creativity and abstract thinking helps data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases. Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires thinking outside of the box.
Data scientists have to be able to take large problems, like what ad to show to which customer, then based off of hundreds of variables effectively find the right solution. This means taking a larger problem and breaking it down to its smallest parts. Getting rid of noise, and variables that don’t help create a clear pattern. This can sometimes be a messy process. Being able to keep focused on the bigger problem is key.
Who Should Become a Data Scientist
The skills required to be a data scientist are constantly evolving and many companies are trying to find out how to train new data scientists. In the end, the real question is, who should become a data scientist?
Data science requires constant learning. Not just technology, but it also requires constant learning of new fields, specialties and situations. Especially as data science solutions further integrates into more and more departments of corporations. Becoming familiar with one set of vocabulary, and processes is not an option. Without having some bearing in each field limits the hypothesis and logical assumptions required to be made by a good data scientist.
If you are searching for a data scientist inside your company. They are probably already attempting to push into the field. With all the online material, classes, and meet-ups, an individual would have already taken steps to get more involved. If they merely talk about it, but never act upon it, they will act similarly on a new project or idea.
There is some requirement for computational or technical abilities. Excel is a great tool, but there is a need to be able to use more powerful and customizable tools. This includes programming, data visualization and data storage tools. There is no need to be a software engineer. However, data scientists have a general idea of how to make sure code is maintainable, robust and scalable.
Looking to start a data science team?
If you are looking to start a team of your own. Feel free to comment, or email us! We can do everything from point you in the right direction of readings if you want to do it yourself, to come and join you on your journey! Also, feel free to follow our blog. We will keep it up to date as we do new projects, and new questions about data science! If you email us a question, we will try to post about it!
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!