Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
How do Machine Learning Algorithms Learn Bias?
There are funny mishaps that result from imperfectly trained machine learning algorithms. Like my friend’s iPhone classifying his dog as a cat. Or these two guys stuck on a voice activated elevator that doesn’t understand their accent. Or maybe Amazon’s Alexa trying to order hundreds of dollhouses because it confuses the news anchor’s report for a request from its owner. There are also the memes on the Amazon Whole Foods purchase, which are truly in the spirit of defective algorithms.
“Bezos: "Alexa, buy me something from Whole Foods."
Alexa: "Buying Whole Foods."
Bezos: "Wait, what?"”
The Data Science Capstone
For my final capstone for the Galvanize Data Science Immersive, I spent a lot of time exploring the concept of algorithmic bias.
I had partnered with an organization that helps former inmates go back to school, and consequently lowers their probability to recidivate. The task I had was to help them figure out the total cost of incarceration, i.e. both the explicit and implicit costs of someone being incarcerated.
While researching this concept, I stumbled upon Propublica’s Machine Bias essay that discusses how risk assessment algorithms contain racial bias. I learnt that an algorithm that returns disproportionate false positives for African Americans is being used to sentence them to longer prison sentences and deny them parole, that tax dollars are being spent on incarcerating people who would be out in the society being productive members of the community, and that children whose parents shouldn’t be in prison are in the foster care system.
An algorithm that has disparate impact is causing people to lose jobs, their social networks, and ensuring the worst cold start problem once someone has been released from prison. At the same time, people likely to commit crimes in the future are let to go free because the algorithm is blind to their criminality.
How do these false positives and negatives occur and does it matter? To begin with, let us define three concepts related to the Confusion Matrix: precision, recall, and accuracy.
Precision is the percentage of correctly classified true positives as a percentage of the positive predictions. High precision means that you correctly label as many of the true positives as possible. For example, a medical diagnostic tool should be very precise because not catching an illness can cause an illness to worsen.
In such a time sensitive situation, the goal is to minimize the number of false negatives returned. Similarly, if a security breach from one of your employees is pending, you’d like a precise model to predict who the culprit will be to ensure that a) You stop the breach, and b) have the minimal interruptions to your staff trying to find this person.
Recall on the other hand is the percentage of relevant elements returned. For example, if you search for Harry Potter books on Google, recall will be the number of Harry Potter titles returned divided by seven.
Ideally we will have a recall of 1. In this case, it might be a nuisance, and a terrible user experience to sift through irrelevant search results. Additionally, if a user does not see relevant results, they will likely not make any purchases, which eventually could hurt the bottom line.
Accuracy is a measure of all the correct predictions as a percentage of the total predictions. Accuracy does poorly as a measure of model performance especially where you have unbalanced classes.
For precision, recall, accuracy, and confusion matrices to make sense to begin with, the training data should be representative of the population such that the model learns how to classify correctly.
Confusion matrices are the basis of cost-benefit matrices, aka the bottom line. For a business, the bottom line is easy to understand through profit and loss analysis. I suppose it’s a lot more complex to determine the bottom line where discrimination against protected classes is involved.
And yet, perhaps it is more urgent and necessary to do this work. There is increased scrutiny on the products we are creating and the biases will be visible and have consequences for our companies.
Machine Learning Bias Caused By Source Data
The largest proportion of machine learning is collecting and cleaning the data that is fed to a model. Data munging is not fun, and thinking about sampling and outliers and population distributions of the training set can be boring, tedious work. Indeed, machines learn bias from the oversights that occur during data munging.
With 2.5 exabytes of data generated every day, there is no shortage of data on which to train our models. There are faces of different colors, with and without glasses, wide eyes and narrow eyes, brown eyes and green eyes.
There are male and female voices, and voices with different accents. Not being culturally aware of the structure of the data set can result in models that are blind or deaf to certain demographics thus marginalizing part of our use groups. Like when Google mistakenly tagged black faces as an album of gorillas. Or when air bags meant to protect passengers put women at risk of death during an accident. These false positives, i.e. the conclusion that you will be safe when you will actually be at risk cost people’s lives.
Earlier this year, one of my friends, a software engineer asked the career adviser if it would be better to use her gender neutral middle name for her resume and LinkedIn to make her job search easier. Her fear isn’t baseless; there are unsurmountable conscious and unconscious gender biases at the workplace. There was even a case where a man and woman switched emails for a short period and saw drastic differences in the way they were being treated.
How to Reduce Machine Learning Bias
However, if we are to teach machines to crawl LinkedIn and resumes, we have the opportunity to scientifically remove the discrimination we humans are unable to overcome. Biased risk assessment algorithms result from models being trained on data that is historically biased. It is possible to intervene and address the historical biases contained in the data such that the model remains aware of gender, age and race without discriminating against or penalizing any protected classes.
The data that seeds a reinforcement learning model can lead to drastically excellent or terrible results. Exponential improvement, or exponential depreciation could lead to increasingly better performing self driving cars that improve with each new ride, or they could convince a D.C. man of the truth of a non-existent sex trafficking ring in D.C.
How do machines learn bias? We teach machines bias through biased training data.
If you enjoyed this piece on data science and machine learning. Feel free to check out some of our other works!
Why Data Science Projects Fail
When Data Science Implementation Goes Wrong
Data Science Consulting Process
Recently, our team of data science consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
We love the fact that that her project was not just technically challenging, but that it was geared towards a bigger purpose than selling toasters or keeping customers from quitting your telecommunication plan! She also brought up her experience interviewing for data science roles at Microsoft and other large corporations and how it taught her so much. We wanted to share what she learned so we asked if she would write us a guest post! And she said yes! So without further ado, here is
How to Prepare for a Data Science Interview:
If you are here, you probably already have a Data Science interview scheduled and are looking for tips on how to prepare so you can crush it. If that’s the case, congratulations on getting past the first two stages of the recruitment pipeline. You have submitted an application and your resume, and perhaps done a take home test. You’ve been offered an interview and you want to make sure you go in ready to blow the minds of your interviewers and walk away with a job offer. Below are tips to help you prepare for your technical phone screens and on-site interviews.
Read the Job Description for the Particular Position You are Interviewing for
Data Scientist roles are still pretty new and the responsibilities vary wildly across industries and across companies. Look at the skills required and the responsibilities for the particular position you are applying for. Make sure that the majority of these are skills that you have, or are willing to learn. For example, if you know Python, you could easily learn R if that’s the language Data Scientists at Company X use. Do you care for web-scraping and inspecting web pages to write web-crawlers? Does analyzing text using different nlp modules excite you? Do you mostly want to write queries to pull dataca from SQL and NoSQL databases and analyse/build models based on this data? Set yourself up for success by leveraging your strengths and interests.
Review your Resume before each Stage of the Interviewing Process
Most interviews will start with questions about your background and how that qualifies you for the position. Having these things at the tip of your fingers will allow you allow you to ease into the interview calmly as you won't be fumbling for answers. Use this time to calm your nerves before the technical questions begin.
Additionally, review your projects and be prepared to talk about the Data Science process you used to design your project. Think about why you chose the tools that you used, the challenges that you encountered along the way, and the things that you learned along the way.
Look at GlassDoor for Past Interview Questions
If you are interviewing for a Data Scientist role at one of the bigger companies, chances are they’ve already interviewed other people before you, who may have shared these questions on GlassDoor. Read them, solve them, get a feel of the questions you will be asked. If you cannot find previous questions for a particular company, solve the data science questions from other companies. They are similar, or at the very least, correlated.
Moreover, even if there are no data science questions for that particular company, see what kind of behavioral questions are asked.
Ask the Recruiter about the Structure of the Interview
Recruiters are often your point of contact with the company you are interviewing at. Ask the recruiter questions about how your interview will be structured, what resources you should use when preparing for your interview, what you should wear to the interview, and even the names of your interviewers so you can stalk look them up on LinkedIn and see their areas of specialization.
Do Mock Interviews
Interviewing can be nerve-racking, more so when you have to whiteboard technical questions. If possible, ask for mock interviews from people who have been through the process before so you know what to expect. If you cannot find someone to do this for you, solve questions on a white board or notebook so you get the feel of writing algorithms some place other than your code editor.
Practice asking questions to understand the scope and constraints of the problem you are solving. Once you are hired, you will not be a siloed data scientist. It is reasonable to bounce around ideas and see if you are on the right track. It is not always about getting the correct answer, which often does not exist, but about how you think through problems, and how you work with other people as well.
Practice the Skills that you Will be Tested On
Your preparation should be informed by the job description and the conversation with recruiters. Study the topics that you know will be on the interview. Look up questions for each area in books and online. Review your statistics, machine learning algorithms, and programming skills.
Additionally, Spring Board has compiled a list of 109 commonly asked Data Science Questions.
KDnuggets also has a list of 21 must know Data Science Interview Questions and Answers.
Follow Up with Thank You Emails
This is probably standard etiquette for any interview but remember to send a personalized thank you email within 24 hours of your interview. Also, if you have thought of the perfect answer to that question you couldn't solve during your interview, include it as well. Don’t forget to express your enthusiasm for the work that Company X does and your desire to work for them.
If you get an offer after your first round of data science interviews, Congratulations! Close this tab and grab a beer. If you are turned down, like most of us are, use the lessons you learned from your past interviews to prepare for your next interviews. Interviews are a good way to identify your areas of weakness, and consequently become a better candidate for future openings. It’s important to stay resilient, patient, and keep a learner’s mindset. Statistically, you probably won't get an offer for each position you apply for. Like the excellent data scientist you are, debug your interviewing process and up your future odds.
Other Great Data Science Blog Posts To Help Make You A Better Data Scientist!
How To Ensure Your Data Science Teams And Projects Succeed!
Why And How To Convince Your Executives To Invest in A Data Science Team?
Data science projects fail all the time! Why is that? Our team of data science consultants have seen many good intentions go wrong because of failure to empower data science teams, locking away access to data, focusing on the wrong problem, and many other problems that could be avoided! We have written 32 of the reasons we have seen data science projects fail. We are sure there are more and would love to get comments on what your teams have seen! What makes a data science project team succeed?
1. The data scientists aren’t given a voice
Data science and strategy can play very nicely together when allowed! Data scientists are more than just over glorified analysts! They have access to possibly all the data a company owns! That means they know every movement the company has made with every outcome (if the data was stored correctly). However, they are often left in the basement with the rest of the tech teams forced to push out reports like any other report developer. There is a reason companies like Amazon, and Google continue to do so well! It is because the people with the Data have a voice!
2. Starting with the wrong questions.
Let’s face it. Most technology people often focus more on how cool a project is, not how much money it will save the company. This can sometimes lead to the wrong business questions being answered! This will lead to a team quickly either failing, or losing value inside of the company. The goal should be to do as much to hit high value business targets as possible. That is what keeps data science projects from failing or at least, being unnoticed.
3.Not addressing the root cause just trying to improve the effect of a process
One of the most dubious and hard to spot until it is too late is not realizing a data science team wasn’t even looking at the actual cause of the problem. When our data science team comes in, one of the things we assess is how a data science team develops their hypotheses. How far do they dig in the data, how many false hypotheses do they think of. How about other causations that could cause a similar output. An outcome can have a very deep root.
4. Weak stakeholder buy-in
Any project, data science, machine learning, construction, or any other department will fail without stakeholder buy in! There needs to be an executives to own the project. This gives a team acknowledgement for their hard work and it also ensures that there will be funding! Without funding, a project will come to a dead halt.
5.Lack of access to data
Slightly attached to the previous point. Locking access away from data scientists, whether it be tools or data is just a waste of time. If a data scientists is forced to spend all day begging a DBAs for access, don’t expect projects to finish any time soon!
6. Using Faulty/Bad Data
Any data specialist (data engineer, analyst, scientist, architect) will tell all managers the cliche saying. Garbage in, garbage out! If the data science team trains a machine learning model on bad data, then it will get bad results. There is no way around it! Even if an algorithm works with 100% accuracy, if all of the data classification is incorrect, then so are the predictions. This will lead to a failed project and executives no longer trusting the data science team.
7.Relying on Excel as the main data storage….or Access
As data science consultants, our team members have come across plenty of analytics and data science projects. Often times, because of lack of support, data scientists and analyst have to construct make shift storage centers because they are not given a sandbox or server to work on. Excel and Access both have their purposes. One of them is not managing large sets of data for analytics purposes. Don’t do that to a data scientists. This will just get poorly designed systems and high turn over!
8. Having a data scientist build their own ETLs
We have seen ETL systems built from R because instead of getting an expert ETL developer a company was allowing the poor data scientists a crack at it. Don’t get us wrong, data scientists are smart people. However, you would much rather have them focus on algorithms and machine learning program implementations instead of spending all day engineering their own data warehouses.
9. Lack of diverse Subject Matter Experts
Data scientists are great with data and often a few subjects that revolve around the data they have worked with. However, data, and businesses are so very different. Sometimes this means a company needs to partner the data science experts with experts. Otherwise, they won’t have the context to better understand complex subjects like manufacturing, pharmaceuticals and avionics.
10.Poorly assessing a team's skills and knowledge of data science tools
If a data science team doesn’t have the skills to work with Hadoop, why would you set up a cluster? It is always good to be aware of a teams skill set first. Otherwise they won’t be able to produce products and solutions at the highest level. Data science tools vary, so make sure you look round before you make any solid decisions.
11.Using technologies because they are cool and not useful
Just because you can use certain tools for a problem. Doesn’t mean it is always the best option. We wouldn’t recommend R for every problem. It is great for research type problems that don’t need to be implemented. If you want a project to get implemented into a larger system, than python or even C++ might be better(depending on the system). Same things goes for Hadoop, or MySQL, or Tableau and Power BI. They all have a place. Don’t let a team do something, just because they can.
12. Lacking an experienced data science leader
Data science is still a new field. That doesn’t mean you don’t need a leader who has some experience working on a data science team. Without one that has a basic understanding of good data science practices. A data science team could struggle to bring projects to fruition. They won’t have a roadmap for success, they will have bad processes and this will just lead to a slew of other problems.
13. Hiring a scientists with limited business understanding
Technology and business are two very different disciplines and sometimes this leads to employees knowing one subject really well and failing to know the other at all. This is ok if a small percentage of the data science team are built up of purely research based employees. It is important to note that some of them should still be very knowledgable of how to act in a business. If you want to help them get up to speed quickly. Check out this list of “How To Survive Corporate Politics as a data scientist”.
14. A boss read one of our blog posts and now thinks he can solve world hunger
Algorithms can’t solve every problem, at least not easily! If this were true, a lot more problems would be solved by now. Having a boss who simply went to a data science conference and now believes he or she can push the data science team to solve every business gap is not reasonable. Limited resources, complexity of subjects, and unstable processes can quickly destroy any project.
15. The solutions are too complex
One mistake executives and data scientists make is thinking their data science models should be complex. It makes sense right, data science is a complex, statistics based subject. This is not true all the time! The simpler you can build a model, or integrate a machine learning solution means a data team will have an easier time maintaining the algorithm in the future.
16. Failing to document
Most technology specialist dislike documentation. It takes time, and it isn’t building new solutions. However, without good documentation, they will never remember what they did 1 month ago, let alone a year ago. This means tracking bugs, tracking how programs work, common fixes, play books, the whole nine yards. Just because data science teams aren’t technically software engineering teams, it doesn’t mean they can step away from documenting how their algorithms work and how they can to their conclusions.
17. The Data science team went with every new request from stakeholders(scope creep).
As with any project, data science teams are susceptible to scope creep. Their stakeholders demand new features every week. They add new data points, and dashboard modules. Suddenly, the data science project seems like it can never be finished. You have half a team focused on a project that managers can’t make their minds up on. Then it will never succeed.
18. Poorly designed models that are not robust or maintainable :
Even well documented bad systems lead to quick failures. Data science projects have lots of moving pieces. Data flowing through ETLs, dashboards, websites, automated report, QA suites, and so one. Any piece of these can take a while to develop, and if developed badly even longer to fix! Nothing is worse then spending an entire FTE on maintaining systems that should be able to run automatically. So spend enough time planning up front that you are not stuck with terrible legacy code.
19. Disagreement on enterprise strategy.
When it comes down to it. Data science offers a huge advantage when implemented well for corporate strategy. That also means the projects being done by some of the more experienced data scientists need to closely align with a directors and executives strategy. Strategies change, so these projects need to come out fast and be focused on maximizing the decisions making of executives. If you are producing a dashboard focused on growth, but an executive team is trying to focus on rebranding, you are wasting time and money!
20. Big data silos or vendor owned data!
You know what is terrible. When data is owned by a vendor. This makes it so hard for data science teams to actually analyze their companies data. Especially if the vendor offers a bad API, none at all or worse, they charge you just to use it. To get a company's data! Imagine, a poor data science budget going to buy back the data! Similarly, if all the data is in silos. It is almost impossible for a data scientists to bring it all together. There are rarely crosswalks or data standards so they are often stuck hopelessly starring at lots of manual work to make data relate.
21 . Problem avoidance(Ignoring the elephant in the room!)
We have all done it! Even data scientists! We know the company has a major problem, it’s the elephant in the room and it could be solved. However, it might be part of company culture, or a problem that no one discusses because it is like the emperor with new clothes. This is sometimes the best place for a data science team to focus.
22. The data science team hasn’t built trust with stakeholders
Let’s be honest. Even if a team develops a 100% accurate algorithm with accurate data, if a team has not been working to build executive trust the entire time, then the project will fail. Why, because every actionable insight a project provides will be questioned, and never implemented.
23. Failing to communicate the value of the data science project
One of the problems our data science consultant team has seen is teams failing to explain the value of a project. This requires...data! You have to use financial numbers, resources saved, competitive advantage gained, etc. To prove to the executives why the project is worth it! The data scientists, use that to help prove their point!
24. Lack of a standardized data science process
No matter how good the data scientists are, without some form of standardization, a team will eventually fail. This may be because a team has to scale and can’t or because a team member leaves. All of this will cause a once working machine to fail.
25. If You Failed To Plan, Plan to Fail
When it comes down to it. There needs to be some amount of planning in the data science projects. You can’t just attempt to find some data sources, make assumptions, attempt to implement some new piece of software without first analyzing the situation! This might take a few weeks and the executives should give you this. If they really want a sustainable piece of software.
26. The data science team competes with other departments(rather than working together)
For some reason or another, office politics exist. Data scientists can often accidently walk over every other department because they are placed in position to help develop strategies and dashboards for the entire company. This might take away jobs from other analysts completely. In turn, this might start fights. So make sure the data science team shares and shows how their projects are helping rather than hurting!
27. Allowing company bias to form conclusions before the data scientists start
Data bias does exist! As a data scientist you can make algorithms and data say whatever you want them to sometimes. However, that doesn’t make it true. Make sure you don’t go into the project with a biased hypothesis that will push you towards early conclusions that might be incorrect.
28. Try to take on to large of a first project
Reading the news about what Google and Facebook are doing with their algorithms may tempt the data science team to take on too large of a project for their first projects. This will not lead to success. You might be lucky and succeed. However, you are taking a huge risk!
29. Manually classifying data
One part of data science that not everyone talks about is data classification. Not just using SVM and KNN algorithms. Nope, we mean actually labeling what the data represents. Someone human has to do that first. Otherwise, the computer will never know how to. If you don’t have a plan on how to classify the data before it gets to the data science team, then someone will have to manually do that. That is one quick way to lose data scientists and have projects fail.
30. Failing to understand what went wrong
Data science projects don’t always succeed. The data science team needs to be able to explain why. As long as it wasn’t a huge drop in the capital budget executives should understand. After all, projects do fail, it is natural. That doesn’t give you an excuse to not know why.
31. Wait to seek out outside help until it is too late
Sometimes the data science team is short on staff, other times you just need new insight. Whatever it might be. The data science team needs to make sure it seeks outside help sooner rather than later. Putting off for help when you know you need it will just lead to awkward conversation with management. They might not want to spend the money, but they also want a project to succeed.
32. Fail to provide actionable insights and opinions
Finally, the data science teams data science project needs to provide actual insight, something actionable. Simply providing a correlation, doesn’t do any good. Executives need decisions, or data to make decisions. If you don’t give them that, you might as well not have a data science team.
If you have any questions, please feel free to comment below! Let us know how we can help!
As more companies turn towards data science and data consulting to help gain competitive advantage. We are getting more people asking us - Why should we start a data science team? We wanted to take a moment and write down some reasons why your company needs to start a data science team. It is not just a fad anymore, it is becoming a need!
Maintain Competitive Advantage
As more and more companies start integrating data science and better analytics into their day to day strategy. It will no longer be a competitive advantage to make data driven decisions. It will be a necessity. Executives will have to rely on accurate data and sustainable metrics to make constantly correct decisions. If your company moves based on the actual why, vs. the speculations and surface level problems, then they can make a greater and more effective impact internally and externally.
Better Understanding of Current and Possible New Customers
Data science helps give executives and managers a deeper understanding into why customers make the choices they do. Google has had the greatest opportunity as it has almost become a third lobe of some people’s brain. We tell it when we are happy, sad, looking for love, studying, etc. It knows what our questions are. However, Google is not the only one, other companies are beginning to realize that their customers have been telling them their opinions on multiple social platforms and blogs. They just need to go and look and see how to better hear from their customers data.
This is a great opportunity for corporations to seek out what their customers are feeling, about their products, about their companies image and what other stuff they are purchasing. Sometimes this may require purchasing third party data. Even then, this might be worth it depending on the projects being done.
Better Understanding of Internal Operations
Whether it is HR, finance, accounting or operations, data science is helping tie all these fields together and paint a better picture of what is happening in a company. Why are people leaving, why was there so much overtime, how can a company be better at utilizing resources, and so on. These questions can be answered by taking high quality data and blending it to find out the whys. This in turn will provide a better work place environment and increase resource efficiency.
Increases Performance Through Data Driven Decisions
Data science is still a relatively new field. Many companies are still figuring out how to properly implement data science solutions. However, those that do are seeing amazing results. Look no further than Amazon, or Uber. These companies are changing and have changed the way we view certain industries. Why, because they know what their customers want, they know what the industry is doing wrong, and they know how to charge the customer less, but give them more.
Overall, increasing data science and analytics proficiency allows for your executives to trust their data more and make clearer decisions. Consider looking into a team today!
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!