Data science is a tool that has been applied to many problems in the modern workplace. Thanks to faster computing and cheaper storage we have been able to predict and calculate outcomes that would have taken several times more human hours to process. Insurance claims analysts can now utilize algorithms to help detect fraudulent behavior, retail salespeople can better tailor your experience both online and in store all thanks to data science. We have combined a few examples of real life projects we have worked on as well as a few other ideas we know other teams are working on to help inspire your team. Let us know if you need help figuring out your next data science project!
Predicting the Best Retail Location
One of the true factors of business success is “Location, Location, Location”. You have probably seen this to be true when you see a spot that always has a new restaurant or store. For some reason, it just will never succeed. This forces businesses to think long and hard about where is the best location for there business. The answer is where your customers are when they think about your product. But where is that?
This example is actually being taken on by a few companies. One example is Buxtonco. Buxtonco is answering where should you open your next business with data! There site exclaims:
“That any retailer can achieve greater success and growth by understanding their customer and that there is a science behind identifying who that customer is, where potential customers live, and which customers are the most valuable”
The concept is brilliant. Think Facebook geo-fencing in real life. By looking for where your customers may spend their time, and what they might be doing in certain locations the technology can help determine where it would be best to open your next business. Whether that be a coffee shop or a dress store. Data science and machine learning can occasionally seem limited to the internet. However, information provides power both online and in real life.
Predicting why patients are being readmitted
Being able to predict patient readmission can help hospitals reduce their costs as well as increase population health. Knowing who is likely to be readmitted can also help data scientist find the “why” behind specific populations being readmitted. This is not just important because of public health but also because the affordable care act reduces the amount of medicaid for claims when readmission occur prior to 30 days.
Hospitals around the country are melding multiple data sources beyond just typical claims data to get insight into what is causing readmission. One of the common approaches is researching ties between readmission and socioeconomic data points like income, addresses, crime rates, and air pollution.
Similar to the way marketers are targeting customers using machine learning and product recommendation systems that factor socioeconomic data points to tell how to sell to a customer. Hospitals are trying to better tailor their care to help their patients based off of how other similar patients have responded in the past.
Even a phone call at the right time after an operation has been shown to reduce the amount of readmission that occurs. Sometimes the reason patients are readmitted can have nothing to do with how the doctors treated them in the hospital but instead it could be that the patient didn’t understand how to take their medication, or they didn’t have anyone at their house to help take care of them. Thus, being able to figure out the why behind the readmission can in turn fix it. Once policy makers understand the why, it is much easier to develop better practices to approach each patient.
Detecting insurance fraud
Insurance fraud costs companies and the consumers (who are subjected to higher rates) tens of billions of dollars a year. To add to the problem, attempting to prove claims are fraudulent can in turn costs the companies more than the original cost of the claim itself.
This is why many companies have been turning to machine learning and predictive models to detect fraud. This helps pinpoint more claims that should be researched by human auditors. This method doesn’t just reduce the costs of human hours, it also increases the opportunity to reclaim stolen dollars from fraudulent claims.
Once you have a fine tuned algorithm, the accuracy and rate at which your team processes fraudulent claims will increase dramatically.
Brick And Mortar Stores Predicting Product Needs and Prices As You Walk In
The concept of targeting a price for a specific customer is a tried and true method that many companies have implemented(even before we termed the coin “data scientists”). If a salesman thought you were wearing an expensive suit, then they might offer you the same car they sold earlier that day at a higher price. In the same way, now the computer can quantify the best price to encourage a customer to make the decision to buy while also maximizing profits(Like Orbitz Did In 2012 For Mac Users “Oh, you like spending $1200 on your computer…well here is your plane ticket +$100”).
This isn’t even limited to e-commerce! Image if in life retail stores actually start using previous purchase history as soon as a customer walks into the door. Perhaps it’s a Men’s Warehouse or a Macy’s, pick your store. They could meld that data with other information like your LinkedIn profile and Glassdoor salary estimates. Now they will know how much money you make and your buying habits, maybe even some notes from the previous salesman or saleswoman. All of this combined would allow them to better tailor an experience for you and other customers like you.
For customers who enjoy buying clothes and other products in person this could help provide a major competitive advantage for Men’s Warehouse or other similar companies that already have a tendency to focus on the experience not just the sale(who knows, maybe that is why there stock has doubled in the last 6 months…probably not). Plus, then companies can better plan which sales person to partner with which customer. Maybe they can predict that a customer will respond better to the hard sell vs. the softer approach. All of this paired with a human could massively increase sales and customer satisfaction.
Managing IT service desks is a balance of having enough tech support professionals to minimize wait time and keep customer satisfaction at a high and keep costs low by not having too many people working at one time. This is a fine balance that is
Detecting Who To Call Fundraisers
As someone who has managed a fund-raiser, automation only takes things so far. Certain donors may respond better to custom emails, or slightly different worded messages, maybe they respond better to a phone call. This is where data science and targeted messages and approaches can help.
Marketing departments are already implementing techniques like A/B testing to their websites and emails to help convince customers to buy a product. The concept of finding the right donors isn’t really different at all.
The key is to start collecting data and managing it efficiently. We have been talking to a few non-profits, and although this use case is a possibility, most of them don’t have the data stored in any form of data storage besides excel, or a basic data base. This makes it difficult to pull out these insights. This is why step one is to creating a data system that will provide insights in the future.
Mental health care
One third of the population suffering from physical ailments also suffer from an accompanying mental health condition exacerbating the illness, reducing quality of life, and increasing medical costs. Some companies like Quartet are finding that if they help improve the mental health along with the physical health of their customers, it helps improve their overall health and reduce costs for the patients. Quartet is working on a collaborative health ecosystem by curating effective care teams and combining their expertise with data-driven insights.
We have also worked with insurance providers on similar projects where we helped them calculate the overall ROI of their new behavioral health plan that they had implemented to help deal with a physical pathology. It not only opened their eyes to the effects of the program, it also found 300k of savings. We are glad to see that larger companies like Quartet are taking this problem on.
Data science is a tool that allows companies to better serve their customer and their bottom line. However, it all starts with making sure your company is asking the right questions. If a company doesn’t start with the right use cases and questions, it can cost thousands to millions of dollars. Most of this comes down to communication breakdowns. It can be very difficult to translate abstract business directives into concrete models and reports that provide the impact and influence on decision making that was required.
Our team wants to help equip your data scientists with the tools to increase their personal growth and your departments performance. If you want to start seeing growth in your team and your bottom line, then please feel free to contact us here!
Call To Action
Are you an executive or director that needs help on your next data science project ? We want to help! Our team specializes in helping deliver custom data science solutions and helping decide which projects are best for your company's overall strategy. Contact Us Here Today!
Interested In Reading More About Being A Better Data Scientist?
How To Grow As A Data Scientist
Boosting Bagging And Building Better Algorithms
How To Survive Corporate Politics As A Data Scientist
8 Top Python Libraries For Machine Learning
What Is A Decision Tree
One of the current common projects for companies to take on today are the migrations. Not just from one database system to another. Like a recent project where we migrated Oracle databases to MS SQL Server, Oracle to PostgreSql or MS SQL server to MYSQL, etc. But also, migrating servers from local hardware to cloud based systems like AWS and Azure.
The goal of these projects are simple.
Reduce costs, and increase the ability to spin up servers and databases on a whim.
Databases are not cheap to licenses when you use products like Oracle or MS SQL server. Add on top of that all of the Oracle and Microsoft's one off costs for every tool and doo-hicky a company can add on and the price starts to become overwhelming.
Databases no longer have to cost an arm and a leg and with the ability to reduce costs further by not just converting RDBMS (relationship database management systems) but also migrating to AWS.
In addition, the ease at which a general user can spin up and down a server on AWS allows for much more agile project development. This makes it easier to go from prototype to final product without dealing with as much bureaucracy. That is, if the company even had enough space on their servers.
Back when companies had to manage a lot more of their own servers. If a company ran out of space on their current server racks, they had to go through the process of buying a new server.
That means getting approval, putting in a PO, waiting for the server, configuring it, securing it, and then putting it online. This was not only expensive. It could take weeks, months...maybe even a year or two depending on the pace of the company(let’s not even get started on a discussion on whether or not the server was needed. Often times, there was plenty of space on a server somewhere..just no one knew it existed).
Now, if a new server is needed, depending on the internal processes, it could be a quick approval a way from being spun up. A new database just a statement away.
Both offer significant advantages in savings.
However, database migrations and conversions are technologically complicated and intense projects! They require experts in database management, security and project managers to ensure the end result is secure and acts 100% the same as the previous set of objects.
So why would a company do it?
Over the next few articles we will be discussing how to convert various databases like Oracle and MS SQL to other options that can be free and just as effective.
We will also be listing out the benefits to switching to a cloud system.
At the end of the day, all of this will help reduce an IT department's cost substantially.
If you need help with any services such as data migrations or converting one RDBMS to another, our team would be happy to help! We have many members who have done all forms of data conversions and migrations.
If you want to read more about databases, data science and how to manage great data teams, then check out the articles below.
Should Our Team Invest In A Data Warehouse?
How To Survive Corporate Politics As A Data Scientist
8 Great Libraries For Machine Learning
Creating A Better Algorithm With Boosting and Bagging
Guest Written By Rebecca Njeri
Last Thursday, I attended the machinery.ai conference in Seattle, WA, and got to listen to talks by Machine Learning experts that ranged from Machine Thinking to Integrating Data Science into Legacy Products. After about 1.5 years of learning and practising data science, this conference reminded me of the things that intrigued me when I first started learning data science, and I thought that I should write a post explaining the three different groups of machine learning algorithms.
Machine Learning can be defined as the science of getting computers to act without being explicitly programmed. It can be further divided into three broad categories: supervised learning, unsupervised learning, and reinforcement learning. A machine learning model should be chosen depending on the nature of the data available as will be illustrated below.
Asish Bansal premised his talk, Machine Thinking, by stating that not all business problems need a machine learning or deep learning solution. He argued that most business problems have a software engineering solution, and later, if need be, a machine learning or deep learning solution can be developed. To illustrate his point, he used the “FizzBuzz in TensorFlow interview” example where Joel Grus codes, as a joke, a TensorFlow solution to the fizzbuzz problem.
Bansal’s talk reminded me of the importance of the business understanding and data understanding parts of the CRISP-DM process. Understanding the kind of data available: numbers, words, images, or voice data, labelled versus unlabeled, will determine what kind, if any, machine learning algorithm is the appropriate solution.
The main goal of supervised learning is to learn a model from labeled training data that allows us to make predictions about unseen or future data(Python Machine Learning, 3). Supervised learning can be divided into two categories depending on the outcome. If the outcome is a continuous value, we have a regression model, and if the outcome is discrete class labels, there is a classification model. There can be both binary classification models and multi-class classification models.
The simplest example of a regression problem is y = mx + c, where a univariate independent variable x is correlated with a dependent variable y, and an equation can be fit to known values, and used to predict unknown values of y given x. Another example of a regression problem, to once more borrow from the machinery.ai talks, is how long a person’s commute how will take given a labelled training set that has weather information and time of day as the independent variables, and commute times as the associated response variable.
Commonly occurring examples of binary classification problems in business analytics include: whether a customer churn or not churn, whether a lead convert or not, whether a transaction is fraud, whether an email is spam or not, among others.
Multi-class classification problems are similar to binary classification problems except there are more than two class labels. An example of this can be a classification of the different demographics of people who frequent a bookstore where labels can include: children under five, teens, young adults, adults, etc. Clearly segregating the shoppers can facilitate more efficient marketing campaigns and help the store’s bottom line.
In reinforcement learning, the goal is to develop a system that improves its performance based on interactions with the environment. The term reinforcement learning is actually borrowed from psychology which refers to any “stimulus which strengthens or increases the probability of a specific response. For example, if you want your dog to sit on command you may give him a treat every time he sits for you.”
For a machine learning example, when a self driving car takes a sharp turn too fast and moves outside its lane, it learns to adjust its speed the next time it takes that turn to ensure it stays within its lane. A reinforcement learning model improves its performance because it learns as it interacts with its environment.
Unsupervised learning is machine learning where there is unlabeled data or data of unknown structure. Examples of unsupervised learning algorithms include clustering and dimensionality reduction such as Principal Component Analysis. The model tries to learn patterns and correlations within the data on its own. Without an associated response variable Y, the goal is to “discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations?”
If the bookstore problem was presented without the class labels of the shoppers, a clustering algorithm could be fit to the data to separate the shoppers into different groups.
Almost every data science talk I have listened to underlines the fact that majority of data science work is data mining and data cleaning before any machine learning models can be built. In fact, most supervised and unsupervised learning algorithms are available in Python’s sklearn library, in RStudio, or some other form of open source software. Ultimately, an intimate understanding of the data that is available, and the implementation of the different machine learning algorithms, is necessary to leverage the power of supervised, unsupervised, and reinforcement learning.
Andrew Ng’s Machine Learning Class on Coursera
Just for gags: Alexa And Google Home Are Scheming Against Apple's HomePod
Read More Data Science and Machine Learning Blog Posts
Creating A Better Algorithm With Boosting and Bagging
How To Survive Corporate Politics As A Data Scientists
Statistics Review For Data Scientists
A Guide To Starting A New Data Science Project
How To Grow A Data Science Team
Web scraping and utilizing various APIs are great ways to collect data from websites and applications that can later be used in data analytics. There is a company called HiQ that is well known for web scraping. HiQ crawls various "Public" websites to collect data and provide analytics for companies on their employees. They help companies find top talent using sites data like Linkedin, and other public sources to gain the information needed in their algorithms.
However, they ran into legal issues when Linkedin asked them to cease and desist as well as put in certain technical methods to slow down HiQ's web crawlers. HiQ subsequently sued Linkedin and won! The judge said as long as the data was public, it was scriptable!
This was quiet the blow for scrapers in general.
So how can your company take advantage of online public data? Especially when your team might not have a programming background.
Image from commit strip (Here)
Web scraping typically requires a complex understanding of HTTP requests, faking headers, complex Regex statements, HTML parsers, and database management skills.
There are programming languages that make this much easier such as Python. This is because Python offers libraries like Scrapy and BeautifulSoup that make scraping and parsing HTML easier than old school web scrapers.
However, it still requires proper design and a decent understanding of programming and website architecture.
Let's say your team does not have programming skills. That is ok! One of our team members recently gave a webinar at Loyola University to demonstrate how to scrape web pages without programming. Instead, Google sheets offer several useful functions that can help scrape web data. If you would like to see the video of our webinar it is below. If not, you can continue to read and figure out how to use Google Sheets to scrape websites.
The functions you can use for web scraping with google sheets are:
All of these functions will scrape websites based off of different parameters provided to the function.
Web Scraping With ImportFeed
The ImportFeed Google Sheet function is one of the easier functions to use. It only requires access to Google Sheets and a URL for a rss feed. This is a feed that is typically associated with a blog.
For instance, you could use our RSS feed "http://www.acheronanalytics.com/2/feed".
How do you use this function? An example is given below.
That is all that is needed! There are some other tips and tricks that can help clean up the data feed as you will get more than just one column of information. For now, this is a great start at web scraping.
Do The Google Sheet Import Functions Update?
All of these import function automatically update data every 2 hours. A trigger function can be set to increase the cadence of updates. However this requires more programming.
This is it in this case! From here, it is all about how your team uses it! Make sure you engineer a solid data scraping system.
Web Scraping With ImportXML
The ImportXML function in Google Sheets is used to pull out specific data points using HTML ids, and classes. This requires some understanding of HTML and parsing XML. This can be a little frustrating. So we created a step by step for web scraping for HTML.
Here are some examples from an EventBrite page.
The truth about using this function is that it requires a lot of time. Thus, it requires planning and designing a good google sheet to ensure you get the maximum benefit from utilizing. Otherwise, your team will end up spending time maintaining it, rather than working on new things. Like in the picture below
Web scraping With ImportHTML
Finally we will discuss ImportHTML. This will import a table or list from a web page. For instance, what if you want to scrape data from a site that contains stock prices.
We will use the http://www.nasdaq.com/symbol/snap/real-time. There is a table on this page that has the stock prices from the past few days.
Similar to the past functions you need to use the URL. On top of the URL, you will have to mention which table on the webpage you want to grab. You can do this by utilizing the which number it might be.
An example would be ImportHTML("http://www.nasdaq.com/symbol/snap/real-time",6). This will scrape the stock prices from the link above.
In our video above, we also show how we combine scraping the stock data above and melded it with news about the Stock ticker on that day. This could be utilized in a much more complex manner. A team could create an algorithm that utilizes the stock price of the past, as well as new articles and twitter information to choose whether to buy or sell stocks.
Do you have any good ideas of what you could do with web scraping? Do you need help with your web scraping project? Let us know!
Other great read about data science:
What is A Decision Tree
How Algorithms Can Become Unethical and Biased
Intro To Data Analysis For Everyone Part 1
Why Invest In A Data Warehouse?
200 years ago, John Snow, an English Physician created a geospatial map showing the spread of cholera along streets served by a contaminated water pump in London. That map helped disprove the theory that miasma-bad air was the cause of cholera. The map provided evidence for his alternative hypothesis: cholera was spread by microbes in the water. One of the more interesting data points he had to work with was the case of a woman who lived further away from the clusters of the disease but who had somehow contracted it. John Snow discovered that that woman especially liked the water from that part of town and would have it delivered to her.
Visualizing data through maps and graphs during early exploratory data analysis can give rise to hypotheses that can be proved with further analysis and statistical work. It can reveal patterns that are not immediately obvious when looking at a data set. Fortunately, unlike 200 years ago, there currently exist tools that automate and simplify exploratory data analysis. While some correlations have to be teased out through complex ML algorithms, some reveal themselves easily in the early stages of exploratory data analysis.
Below is a walk through an exploratory data analysis of UN malnutrition data.
Data Analysis begins with data. There are various ways to collect data depending on your project. Some may be as simple as downloading a csv file from the census UN websites, receiving access to an internal database from a partner organization, or scraping the internet for data. A friend working on a skateboard trick identifier went skating with four friends and set up cameras at four different angles to get the training set of images he needed for his algorithm.
Depending on the amount of data you are working with, it may make sense to use a local hard drive or to use cloud storage. Even when you don’t have a lot of data, it may make sense to use cloud services so you may learn the different stacks currently in production.
Accessing the Data:
Exploratory Data Analysis Use a notebook
Below is an EDA walkthrough of UNICEF undernutrition data. You may find it here.
Using Graphs for EDA, Global Malnutrition Case Study
What country has the highest malnutrition levels? What has been the malnutrition trend in this country?
The malnutrition graph above made me wonder what was happening in Georgia in 1991 and 1992, and I learnt that was when the Georgian Civil War occurred. This really piqued my interest because Kenya is in the middle of re-electing a President which in the past has led to ethnic conflicts. I plotted Kenya’s malnutrition graph, and noticed that the peaks coincide with elections and post-election violence.
Although data sets will vary in the number of columns and rows, type of data contained, spread of the data, among others, basic EDA tools can provide an inroad to these data sets before more complex data analysis.
Python for Data Analysis
Great Future Data Science Reads!
A Guide To Designing Data Science Projects
How Machine Learning Algorithms Learn Bias
8 Great Python Libraries For Machine Learning
Basic Data Science And Statistics That Every Data Scientists Should Know
Why Use Data Science?
Guest written by Rebecca Njeri!
What is a Decision Tree?
Let’s start with a story. Suppose you have a business and you want to acquire some new customers. You also have a limited budget, and you want to ensure that, in advertising, you focus on customers who are the most likely to be converted.
How do you figure out who these people are? You need a classification algorithm that can identify these customers and one particular classification algorithm that could come in handy is the decision tree. A decision tree, after it is trained, gives a sequence of criteria to evaluate features of each new customer to determine whether they will likely be converted.
To start off, you can use data you already have on your existing customers to build a decision tree. Your data should include all the customers, their descriptive features, and a label that indicates whether they converted or not.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until you reach a small enough set that contains data points that fall under one label.
Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. The decision on which feature to split on is made based on resultant entropy reduction or information gain from the split.
Classification problems for decision trees are often binary-- True or False, Male or Female. However, decision trees can also be used to solve multi-class classification problems where the labels are [0, …, K-1], or for this example, [‘Converted customer’, ‘Would like more benefits’, ‘Converts when they see funny ads’, ‘Won’t ever buy our products’].
Using Continuous Variables to Split Nodes in a Decision Tree
Continuous features are turned to categorical variables (i.e. lesser than or greater than a certain value) before a split at the root node. Because there could be infinite boundaries for a continuous variable, the choice is made depending on which boundary will result in the most information gain.
For example if we wanted to classify quarterbacks versus defensive ends on the Seahawks team using weight, 230 pounds would probably be more appropriate as a boundary than 150 pounds. Trivial fact: the average weight of a quarterback is 225 pounds, while that of a defensive end is 255 pounds.
What is Entropy/Information Gain?
Shannon’s Entropy Model is a computational measure of the impurity of elements in the set. The goal of the decision tree is to result in a set that minimizes impurity. To go back to our story, we start with a set of the general population that may see our ad. The data set is then split on different variables until we arrive at a subset where everyone in that subset either buys the product or does not by the product. Ideally, after traversing our decision tree to the leaves, we should arrive at pure subset - every customer has the same label.
Advantages of Decision Trees
Disadvantages of Decision Trees
Pruning is a method of limiting tree depth to reduce overfitting in decision trees. There are two types of pruning: pre-pruning, and post-pruning.
Pre-pruning a decision tree involves setting the parameters of a decision tree before building it. There a few ways to do this:
To post-prune, validate the performance of the model on a test set. Afterwards, cut back splits that seem to result from overfitting noise in the training set. Pruning these splits dampens the noise in the data set.
*Post-pruning may result in overfitting the model
*Post-pruning is currently not available in Python’s scikit learn, but it’s available in R.
Creating ensembles involves aggregating the results of different models. Ensemble decision trees are used in bagging and random forests, while ensemble regression trees are used in boosting.
Bagging involves creating multiple decision trees each trained on a different bootstrap sample of the data. Because bootstrapping involves sampling with replacement, some of the data in the sample is left out of each tree.
Consequently, the decision trees created are made using different samples which solves the problem of overfitting to the training sample. Ensembling decision trees in this way helps reduce the total error because variance of the model continues to decrease with each new tree added without an increase in the bias of the ensemble.
A bag of decision trees that uses subspace sampling is referred to as a random forest. Only a selection of the features is considered at each node split which decorrelates the trees in the forest.
Another advantage of random forests is that they have an in-built validation mechanism. Because only a percentage of the data is used for each model, an out-of-bag error of the model’s performance can be calculated using the 37% of the sample left out of each model.
Boosting involves aggregating a collection of weak learners(regression trees) to form a strong predictor. A boosted model is built over time by adding a new tree into the model that minimizes the error by previous learners. This is done by fitting the new tree on the residuals of the previous trees.
If it isn’t clear thus far, for many real-world applications a single decision tree is not a preferable classification as it is likely to overfit and generalize very poorly to new examples. However, an ensemble of decision or regression trees minimizes the overfitting disadvantage and these models become stellar, state of the art classification and regression algorithms.
A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
Fundamentals of Machine Learning for Predictive Data Analytics
A Guide To Designing A Data Science Project
Top 8 Python Programming Languages for Machine Learning
Basic Statistics For Data Scientists
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientists at Galvanize Seattle. One student who came to hear our talk was Rebecca Njeri. Below, she shares tips on how to design a Data Science project.
To Begin, Brainstorm Data Project Ideas
To begin your data science project, you will need an idea to work on. To get started, brainstorm possible ideas that might interest you. During this process, go as wide and as crazy as you can, don’t censor yourself. Once you have a few ideas, you can narrow down to the most feasible/interesting idea. You could brainstorm ideas around these prompts:
Questions To Help You Think Of Your Next Data Science Projects
Write a proposal:
Write a proposal along the Cross Industry Standard Process for Data Mining (CRISP DM standards) which has the following steps:
What are the business needs you are trying to address? What are the objectives of the Data Science project? For example, if you are at a telecommunications company, that needs to retain its customers, can you build a model that predicts churn? Maybe you are interested in using live data to help better predict what coupons to offer what customers at the grocery store.
What kind of data is available to you? Is it stored in a relational or NoSQL database? How large is your data? Can it be stored and processed on your hard drive or will you need cloud services? Are the any confidentiality issues or NDAs involved if you are working in partnership with a company or organization? Can you find a new data set online that you could merge and increase your insights.
This stage involves doing a little Exploratory Data Analysis and thinking about how your data will fit into the model that you have. Is the data in data types that are compatible with the model? Are there missing values or outliers? Are these naturally occurring discrepancies or errors that should be corrected before fitting the data into a model? Do you need to create dummy variables for categorical variables? Will you need all the variables in the data set are some dependent on each other?
Choose a model and tune the parameters before fitting it to your training set of data. Python’s scikit learn library is a good place to get model algorithms. With larger data, consider using Spark ML.
Withhold a test set of data to evaluate the model performance. Data Science Central has a great post on different metrics that can be used to measure mode performance. The Confusion Matrix can help with considering the cost-benefit implications of the model’s performance.
Deployment and implementation are some of the key components of any data driven project. You have to get past the theory and algorithms and actually integrate your data science solution into the larger environment.
Flask and bootstrap are great tools to help you deploy your data science project to the world.
Planning Your Data Science Projects
Keep a timeline with a To Do, In Progress, Completed and Parking section. Have a self-scrum(lol) each morning to see what you accomplished the previous day and set a goal for the new day. It could also help to get a friend with whom to scrum and help you keep track of your metrics. Goals and metrics can help you hold yourself accountable and ensure that you actually follow through and get your project done.
Track your Progress
Create a github repo for your project. Your proposal can be incorporated as the read me. Commit your work at the frequency which makes you comfortable, and keep track of how much progress you are making on your metrics. A repo will also make it easier to show your code to friends/mentors for a code review.
Knowing When to Stop Your Project
It may be good to work on your project with a minimum viable product in mind. You may not get all the things on your To Do list accomplished, but having an MVP can help you know when to stop. When you have learned as much as you can from a project, even if you don’t have the perfect classification algorithm, it may be more worthwhile to invest in a new project.
Some Examples Of Data Driven Projects
Below are some links to Github repos of some Data Science Capstones:
Predicting Change in Rental Price Units in NYC
All the best with your new Data Science project! Feel free to reach out if you need someone to help you plan your new project.
Want to be further inspired on your next data driven project!
Check out some of our other data science and machine learning articles. You never know what might inspire you.
Practical Data Science Tips
Creatively Classify Your Data
25 Tips To Gain New Customer
How To Grow Your Data Science Or Analytics Practice
Come join our team of data scientists and machine learning experts as we discuss ethical machine learning at DAML (Data Analytics Machine Learning ) at Redfin. Our presentation will be followed by Josh Poduska is a Senior Data Scientist in HPE’s Big Data Software Group. Who will be discussing Machine Learning on Distributed Systems.
We are very excited for the opportunity to present and can’t wait to see you guys there! It is 100% free and food is provided. Free data science and machine learning talks + free food? What more do you need!
Click Here To RSVP to DAMLs Machine Learning Talk on August 24th For Free
Ethical Machine Learning
Non-technical companies are slowly finding ways to increase their business value using the increased speed of computing and statistics. The problem is, business has always been more concerned about increasing the bottom line, vs. social impact. It is one thing when we joke about large e-commerce sites selling us that extra toaster. But what about when companies that have products that have been proven harmful reach out to data scientists and attempt to have them develop systems that increase the profit for a product that has a negative social impact, or when companies use data science to manipulate the customer, rather than benefit them. Should we? Is it right to forget about the social impact just to make an extra dollar?
Machine Learning on Distributed Systems
Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Feel free to read some of our other blog posts as well!
Best Python Libraries for Machine Learning
Automating Your Data Science Workflow
Should We Start A Data Science Team?
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
How do Machine Learning Algorithms Learn Bias?
There are funny mishaps that result from imperfectly trained machine learning algorithms. Like my friend’s iPhone classifying his dog as a cat. Or these two guys stuck on a voice activated elevator that doesn’t understand their accent. Or maybe Amazon’s Alexa trying to order hundreds of dollhouses because it confuses the news anchor’s report for a request from its owner. There are also the memes on the Amazon Whole Foods purchase, which are truly in the spirit of defective algorithms.
“Bezos: "Alexa, buy me something from Whole Foods."
Alexa: "Buying Whole Foods."
Bezos: "Wait, what?"”
The Data Science Capstone
For my final capstone for the Galvanize Data Science Immersive, I spent a lot of time exploring the concept of algorithmic bias.
I had partnered with an organization that helps former inmates go back to school, and consequently lowers their probability to recidivate. The task I had was to help them figure out the total cost of incarceration, i.e. both the explicit and implicit costs of someone being incarcerated.
While researching this concept, I stumbled upon Propublica’s Machine Bias essay that discusses how risk assessment algorithms contain racial bias. I learnt that an algorithm that returns disproportionate false positives for African Americans is being used to sentence them to longer prison sentences and deny them parole, that tax dollars are being spent on incarcerating people who would be out in the society being productive members of the community, and that children whose parents shouldn’t be in prison are in the foster care system.
An algorithm that has disparate impact is causing people to lose jobs, their social networks, and ensuring the worst cold start problem once someone has been released from prison. At the same time, people likely to commit crimes in the future are let to go free because the algorithm is blind to their criminality.
How do these false positives and negatives occur and does it matter? To begin with, let us define three concepts related to the Confusion Matrix: precision, recall, and accuracy.
Precision is the percentage of correctly classified true positives as a percentage of the positive predictions. High precision means that you correctly label as many of the true positives as possible. For example, a medical diagnostic tool should be very precise because not catching an illness can cause an illness to worsen.
In such a time sensitive situation, the goal is to minimize the number of false negatives returned. Similarly, if a security breach from one of your employees is pending, you’d like a precise model to predict who the culprit will be to ensure that a) You stop the breach, and b) have the minimal interruptions to your staff trying to find this person.
Recall on the other hand is the percentage of relevant elements returned. For example, if you search for Harry Potter books on Google, recall will be the number of Harry Potter titles returned divided by seven.
Ideally we will have a recall of 1. In this case, it might be a nuisance, and a terrible user experience to sift through irrelevant search results. Additionally, if a user does not see relevant results, they will likely not make any purchases, which eventually could hurt the bottom line.
Accuracy is a measure of all the correct predictions as a percentage of the total predictions. Accuracy does poorly as a measure of model performance especially where you have unbalanced classes.
For precision, recall, accuracy, and confusion matrices to make sense to begin with, the training data should be representative of the population such that the model learns how to classify correctly.
Confusion matrices are the basis of cost-benefit matrices, aka the bottom line. For a business, the bottom line is easy to understand through profit and loss analysis. I suppose it’s a lot more complex to determine the bottom line where discrimination against protected classes is involved.
And yet, perhaps it is more urgent and necessary to do this work. There is increased scrutiny on the products we are creating and the biases will be visible and have consequences for our companies.
Machine Learning Bias Caused By Source Data
The largest proportion of machine learning is collecting and cleaning the data that is fed to a model. Data munging is not fun, and thinking about sampling and outliers and population distributions of the training set can be boring, tedious work. Indeed, machines learn bias from the oversights that occur during data munging.
With 2.5 exabytes of data generated every day, there is no shortage of data on which to train our models. There are faces of different colors, with and without glasses, wide eyes and narrow eyes, brown eyes and green eyes.
There are male and female voices, and voices with different accents. Not being culturally aware of the structure of the data set can result in models that are blind or deaf to certain demographics thus marginalizing part of our use groups. Like when Google mistakenly tagged black faces as an album of gorillas. Or when air bags meant to protect passengers put women at risk of death during an accident. These false positives, i.e. the conclusion that you will be safe when you will actually be at risk cost people’s lives.
Earlier this year, one of my friends, a software engineer asked the career adviser if it would be better to use her gender neutral middle name for her resume and LinkedIn to make her job search easier. Her fear isn’t baseless; there are unsurmountable conscious and unconscious gender biases at the workplace. There was even a case where a man and woman switched emails for a short period and saw drastic differences in the way they were being treated.
How to Reduce Machine Learning Bias
However, if we are to teach machines to crawl LinkedIn and resumes, we have the opportunity to scientifically remove the discrimination we humans are unable to overcome. Biased risk assessment algorithms result from models being trained on data that is historically biased. It is possible to intervene and address the historical biases contained in the data such that the model remains aware of gender, age and race without discriminating against or penalizing any protected classes.
The data that seeds a reinforcement learning model can lead to drastically excellent or terrible results. Exponential improvement, or exponential depreciation could lead to increasingly better performing self driving cars that improve with each new ride, or they could convince a D.C. man of the truth of a non-existent sex trafficking ring in D.C.
How do machines learn bias? We teach machines bias through biased training data.
If you enjoyed this piece on data science and machine learning. Feel free to check out some of our other works!
Why Data Science Projects Fail
When Data Science Implementation Goes Wrong
Data Science Consulting Process
Recently, our team of data science consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
We love the fact that that her project was not just technically challenging, but that it was geared towards a bigger purpose than selling toasters or keeping customers from quitting your telecommunication plan! She also brought up her experience interviewing for data science roles at Microsoft and other large corporations and how it taught her so much. We wanted to share what she learned so we asked if she would write us a guest post! And she said yes! So without further ado, here is
How to Prepare for a Data Science Interview:
If you are here, you probably already have a Data Science interview scheduled and are looking for tips on how to prepare so you can crush it. If that’s the case, congratulations on getting past the first two stages of the recruitment pipeline. You have submitted an application and your resume, and perhaps done a take home test. You’ve been offered an interview and you want to make sure you go in ready to blow the minds of your interviewers and walk away with a job offer. Below are tips to help you prepare for your technical phone screens and on-site interviews.
Read the Job Description for the Particular Position You are Interviewing for
Data Scientist roles are still pretty new and the responsibilities vary wildly across industries and across companies. Look at the skills required and the responsibilities for the particular position you are applying for. Make sure that the majority of these are skills that you have, or are willing to learn. For example, if you know Python, you could easily learn R if that’s the language Data Scientists at Company X use. Do you care for web-scraping and inspecting web pages to write web-crawlers? Does analyzing text using different nlp modules excite you? Do you mostly want to write queries to pull dataca from SQL and NoSQL databases and analyse/build models based on this data? Set yourself up for success by leveraging your strengths and interests.
Review your Resume before each Stage of the Interviewing Process
Most interviews will start with questions about your background and how that qualifies you for the position. Having these things at the tip of your fingers will allow you allow you to ease into the interview calmly as you won't be fumbling for answers. Use this time to calm your nerves before the technical questions begin.
Additionally, review your projects and be prepared to talk about the Data Science process you used to design your project. Think about why you chose the tools that you used, the challenges that you encountered along the way, and the things that you learned along the way.
Look at GlassDoor for Past Interview Questions
If you are interviewing for a Data Scientist role at one of the bigger companies, chances are they’ve already interviewed other people before you, who may have shared these questions on GlassDoor. Read them, solve them, get a feel of the questions you will be asked. If you cannot find previous questions for a particular company, solve the data science questions from other companies. They are similar, or at the very least, correlated.
Moreover, even if there are no data science questions for that particular company, see what kind of behavioral questions are asked.
Ask the Recruiter about the Structure of the Interview
Recruiters are often your point of contact with the company you are interviewing at. Ask the recruiter questions about how your interview will be structured, what resources you should use when preparing for your interview, what you should wear to the interview, and even the names of your interviewers so you can stalk look them up on LinkedIn and see their areas of specialization.
Do Mock Interviews
Interviewing can be nerve-racking, more so when you have to whiteboard technical questions. If possible, ask for mock interviews from people who have been through the process before so you know what to expect. If you cannot find someone to do this for you, solve questions on a white board or notebook so you get the feel of writing algorithms some place other than your code editor.
Practice asking questions to understand the scope and constraints of the problem you are solving. Once you are hired, you will not be a siloed data scientist. It is reasonable to bounce around ideas and see if you are on the right track. It is not always about getting the correct answer, which often does not exist, but about how you think through problems, and how you work with other people as well.
Practice the Skills that you Will be Tested On
Your preparation should be informed by the job description and the conversation with recruiters. Study the topics that you know will be on the interview. Look up questions for each area in books and online. Review your statistics, machine learning algorithms, and programming skills.
Additionally, Spring Board has compiled a list of 109 commonly asked Data Science Questions.
KDnuggets also has a list of 21 must know Data Science Interview Questions and Answers.
Follow Up with Thank You Emails
This is probably standard etiquette for any interview but remember to send a personalized thank you email within 24 hours of your interview. Also, if you have thought of the perfect answer to that question you couldn't solve during your interview, include it as well. Don’t forget to express your enthusiasm for the work that Company X does and your desire to work for them.
If you get an offer after your first round of data science interviews, Congratulations! Close this tab and grab a beer. If you are turned down, like most of us are, use the lessons you learned from your past interviews to prepare for your next interviews. Interviews are a good way to identify your areas of weakness, and consequently become a better candidate for future openings. It’s important to stay resilient, patient, and keep a learner’s mindset. Statistically, you probably won't get an offer for each position you apply for. Like the excellent data scientist you are, debug your interviewing process and up your future odds.
Other Great Data Science Blog Posts To Help Make You A Better Data Scientist!
How To Ensure Your Data Science Teams And Projects Succeed!
Why And How To Convince Your Executives To Invest in A Data Science Team?
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!