Photo by Tabea Damm on Unsplash
Creating an effective data strategy is not as simple as hiring a few data scientists and data engineers and purchasing a tableau license. Nor is it just about using data to make decisions.
Creating an effective data strategy is about creating an ecosystem where getting to the right data, metrics and resources is easy. It’s about developing a culture that learns to question data, and look at a business problem from multiple angles before making the final conclusion.
Our data consulting team has worked with companies from billion dollar tech companies, to healthcare and just about every type of company in between. We have seen the good, the bad and the ugly of data being utilized for strategy. We wanted to share some of the simple changes that can help improve your companies approach to data.
Find A Balance Between Centralized And Decentralized Practices
Standards and over-centralization inevitably slow teams down. Making small changes to tables, databases and schemas might be forced to go through some overly complex process that keep teams from being productive.
On the other hand, centralization can make it easier to implement new changes in strategy without having to go to each team and then force them to take on a new process.
In our opinion, one of the largest advantages companies can gain is developing tools and strategies that help find a happy medium between centralized and decentralized. This usually involves creating standards to simplify development decisions while improving the ability to manage common tasks that every data team needs to perform like documentation and data visualization. While at the same time decentralizing decisions that are often department and domain specific.
Here are some examples where there are opportunities to provide standardized tools and processes for unstandardized topics.
Creating UDFs and Libraries For Similar Metrics
After working in several industries including healthcare, banking and marketing one thing you realize is that many teams are using the same metrics.
This could be across industries or at the very least across internal teams. The problem is every team will inevitably create different methods for calculating the exact same number.
This can lead to duplicate work, code and executives making conflicting decisions because of top-line metrics that vary.
Instead of relying on each team to be responsible for creating a process to calculate the various metrics you could create centralized libraries that uses the same fields to calculate the correct metrics. This standardizes the process while still providing enough flexibility for end-users to develop their reports based off their specific needs.
This only works if the metrics are used consistently. For example in the healthcare industry metrics such as per patient per month costs (PMPM), readmission rates, or bed turn over rates are used consistently. These sometimes are calculated by EMR like EPIC, but might still be calculated by analysts again for more specific cases. It also might be calculated by external consultants.
Creating functions or libraries that do this work easily can help improve consistency and save time. Instead having each team develop their own method you can simply provide a framework that makes it easy to implement the same metrics.
Automate Mundane But Necessary Tasks
Creating an effective data strategy is about making the usage and management of data easy.
A part of this process requires taking mundane tasks that all data teams need to do and automating them.
An example of this is creating documentation. Documentation is an important factor in helping analysts understand the tables and processes they are working with. Having good documentation allows for analysts to perform better analysis. However, documentation is often put off until the last minute or never done at all.
Instead of forcing engineers to document every new table, a great idea is creating a system that automatically scrapes the available databases on a regular interval and keeps track of what tables exist, who created them, what columns they have, and if they have relationships to other tables.
This would be a project for the devops team to take on, or you could also look into a third party system such as dbForge documentation for SQL Server. Now this doesn’t cover everything, and this tool in particular only works for SQL Server. But a similar tool can help simplify a lot of peoples lives.
Teams will still need to describe what the table and columns are. But, the initial work of actually going through and setting up the basic information can all be automatically tracked.
This can help reduce necessary but repetitive work that can help make everyones life a little easier.
Provide Easier Methods To Share And Track Analysis
This is very specifically geared towards data scientist.
Data scientists will often do their work in Jupyter notebooks and Excel that they only have access to. In addition, many companies don’t enforce the need to use some form of repository like git so that data scientists can version control their work.
This limits the ability to share files as well as keep track of changes that can occur in one’s analysis over time.
In this situation, collaboration becomes difficult because co-workers are often stuck passing files back and forth and self version controlling. Typically that looks like files with suffixes like _20190101_final, _20190101_finalfile…
For those of you who don’t get it, you hopefully never will have to.
On top of this, since many of these python scripts utilize multiple libraries it can be a pain to ensure that as you pip install all the correct versions onto your environment.
All of these small difficulties can honestly can cause the loss of a day or two due to trouble shoot depending on how complex the analysis is that you are trying to run.
However, there are plenty of solutions!
There are actually a lot of great tools out there that can help your data science teams collaborate. This includes companies like Domino Data Lab.
Now, you can always use git and virtual environments as well, but this also demands that your data scientist be very proficient with said technologies. This is not always the case.
Again, this allows your teams to work independently but also share their work easily.
Data Cultural Shift
Adding in new libraries and tools is not the only change that needs to happen when you are trying to create a company that is more data driven. A more important and much more difficult shift is cultural.
Changing how people look and treat data is a key aspect that is very challenging. Here are a couple of reasons why.
For those who haven’t read the book, How To Lie With Statistics, spoiler alert, it is really easy to make numbers to tell the story you want.
There are a lot of ways you can do this.
A team can cherry pick the statistics they want to help their agenda triumph. Or perhaps a research team ignores confounding factors and reports on some statistic that seems to be shocking if you don’t consider all the other variables.
Being data driven as a company means that you need to develop a culture that attempts to look at statistics and metrics and ensures there isn’t anything interfering with the number. This is far from easy. When it comes to data science and analytics.
Most metrics and statistics often have some stipulations that could negate whatever message they are trying to say. That is why creating a culture that looks at a metric and asks why is part of the process. If it were as simple as just getting outputs and p-values. Then data scientists would be out of a job because there are plenty of third-party companies that have products that find the best algorithm and do feature selection for you.
But that is not the only job of a data scientist. They are there to question every p-value and really dig into the why of the number they are seeing.
Data Is Still Messy
Truth be told, data is still very messy. Even with todays modern ERPs and applications, data is messy and sometimes bad data gets through that can mislead managers an analysts.
This can be due to a lot of reasons. How the applications manage data, how system admins of those applications modified said system, etc. Even changes that seem insignificant from a business process side can majorly impact how data is stored.
In turn, when data engineers are pulling data they might not accurately be representing data because of bad assumptions and limited knowledge.
This is why just having numbers is not good enough. Teams also need to have a good sense of the business and the process that create said data to ensure they don’t allow data that is messy into the tables which analysts use directly.
Our perspective is that data analysts need confidence that the data they are looking at correctly represents their corresponding businesses processes. If analysts have to remove any data, or consistently perform joins and where clauses to accurately represent the business, then the data is not “self-service”. This is why, whenever data engineers create new data models, they need to work closely with the business to make sure the correct business logic is collected and represented in the base layer of tables.
That way, analysts can have near 100% trust in their data.
At the end of the day, creating an effective data culture requires a both top down and bottom up shift in thinking. From the executive level, decisions need to be made in what key areas they can help make access to data easier. Then teams can start working at becoming more proficient at actually using data to make decisions. We often find most teams spend too much time working on data tasks that need to get done but could be automated. Improving your companies approach to data can provide a large competitive advantage and allow your analysts and data scientists the ability to work on projects they both enjoy and help your bottom line!
If you team needs data consulting help feel free to contact us! If you would like to read more posts about data science and data engineering, Check out the links below!
Using Python to Scrape the Meet-Up API
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientists at Galvanize Seattle. One student who came to hear our talk was Rebecca Njeri. Below, she shares tips on how to design a Data Science project.
To Begin, Brainstorm Data Project Ideas
To begin your data science project, you will need an idea to work on. To get started, brainstorm possible ideas that might interest you. During this process, go as wide and as crazy as you can, don’t censor yourself. Once you have a few ideas, you can narrow down to the most feasible/interesting idea. You could brainstorm ideas around these prompts:
Questions To Help You Think Of Your Next Data Science Projects
Write a proposal:
Write a proposal along the Cross Industry Standard Process for Data Mining (CRISP DM standards) which has the following steps:
What are the business needs you are trying to address? What are the objectives of the Data Science project? For example, if you are at a telecommunications company, that needs to retain its customers, can you build a model that predicts churn? Maybe you are interested in using live data to help better predict what coupons to offer what customers at the grocery store.
What kind of data is available to you? Is it stored in a relational or NoSQL database? How large is your data? Can it be stored and processed on your hard drive or will you need cloud services? Are the any confidentiality issues or NDAs involved if you are working in partnership with a company or organization? Can you find a new data set online that you could merge and increase your insights.
This stage involves doing a little Exploratory Data Analysis and thinking about how your data will fit into the model that you have. Is the data in data types that are compatible with the model? Are there missing values or outliers? Are these naturally occurring discrepancies or errors that should be corrected before fitting the data into a model? Do you need to create dummy variables for categorical variables? Will you need all the variables in the data set are some dependent on each other?
Choose a model and tune the parameters before fitting it to your training set of data. Python’s scikit learn library is a good place to get model algorithms. With larger data, consider using Spark ML.
Withhold a test set of data to evaluate the model performance. Data Science Central has a great post on different metrics that can be used to measure mode performance. The Confusion Matrix can help with considering the cost-benefit implications of the model’s performance.
Deployment and implementation are some of the key components of any data driven project. You have to get past the theory and algorithms and actually integrate your data science solution into the larger environment.
Flask and bootstrap are great tools to help you deploy your data science project to the world.
Planning Your Data Science Projects
Keep a timeline with a To Do, In Progress, Completed and Parking section. Have a self-scrum(lol) each morning to see what you accomplished the previous day and set a goal for the new day. It could also help to get a friend with whom to scrum and help you keep track of your metrics. Goals and metrics can help you hold yourself accountable and ensure that you actually follow through and get your project done.
Track your Progress
Create a github repo for your project. Your proposal can be incorporated as the read me. Commit your work at the frequency which makes you comfortable, and keep track of how much progress you are making on your metrics. A repo will also make it easier to show your code to friends/mentors for a code review.
Knowing When to Stop Your Project
It may be good to work on your project with a minimum viable product in mind. You may not get all the things on your To Do list accomplished, but having an MVP can help you know when to stop. When you have learned as much as you can from a project, even if you don’t have the perfect classification algorithm, it may be more worthwhile to invest in a new project.
Some Examples Of Data Driven Projects
Below are some links to Github repos of some Data Science Capstones:
Predicting Change in Rental Price Units in NYC
All the best with your new Data Science project! Feel free to reach out if you need someone to help you plan your new project.
Want to be further inspired on your next data driven project!
Check out some of our other data science and machine learning articles. You never know what might inspire you.
Practical Data Science Tips
Creatively Classify Your Data
25 Tips To Gain New Customer
How To Grow Your Data Science Or Analytics Practice
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
How do Machine Learning Algorithms Learn Bias?
There are funny mishaps that result from imperfectly trained machine learning algorithms. Like my friend’s iPhone classifying his dog as a cat. Or these two guys stuck on a voice activated elevator that doesn’t understand their accent. Or maybe Amazon’s Alexa trying to order hundreds of dollhouses because it confuses the news anchor’s report for a request from its owner. There are also the memes on the Amazon Whole Foods purchase, which are truly in the spirit of defective algorithms.
“Bezos: "Alexa, buy me something from Whole Foods."
Alexa: "Buying Whole Foods."
Bezos: "Wait, what?"”
The Data Science Capstone
For my final capstone for the Galvanize Data Science Immersive, I spent a lot of time exploring the concept of algorithmic bias.
I had partnered with an organization that helps former inmates go back to school, and consequently lowers their probability to recidivate. The task I had was to help them figure out the total cost of incarceration, i.e. both the explicit and implicit costs of someone being incarcerated.
While researching this concept, I stumbled upon Propublica’s Machine Bias essay that discusses how risk assessment algorithms contain racial bias. I learnt that an algorithm that returns disproportionate false positives for African Americans is being used to sentence them to longer prison sentences and deny them parole, that tax dollars are being spent on incarcerating people who would be out in the society being productive members of the community, and that children whose parents shouldn’t be in prison are in the foster care system.
An algorithm that has disparate impact is causing people to lose jobs, their social networks, and ensuring the worst cold start problem once someone has been released from prison. At the same time, people likely to commit crimes in the future are let to go free because the algorithm is blind to their criminality.
How do these false positives and negatives occur and does it matter? To begin with, let us define three concepts related to the Confusion Matrix: precision, recall, and accuracy.
Precision is the percentage of correctly classified true positives as a percentage of the positive predictions. High precision means that you correctly label as many of the true positives as possible. For example, a medical diagnostic tool should be very precise because not catching an illness can cause an illness to worsen.
In such a time sensitive situation, the goal is to minimize the number of false negatives returned. Similarly, if a security breach from one of your employees is pending, you’d like a precise model to predict who the culprit will be to ensure that a) You stop the breach, and b) have the minimal interruptions to your staff trying to find this person.
Recall on the other hand is the percentage of relevant elements returned. For example, if you search for Harry Potter books on Google, recall will be the number of Harry Potter titles returned divided by seven.
Ideally we will have a recall of 1. In this case, it might be a nuisance, and a terrible user experience to sift through irrelevant search results. Additionally, if a user does not see relevant results, they will likely not make any purchases, which eventually could hurt the bottom line.
Accuracy is a measure of all the correct predictions as a percentage of the total predictions. Accuracy does poorly as a measure of model performance especially where you have unbalanced classes.
For precision, recall, accuracy, and confusion matrices to make sense to begin with, the training data should be representative of the population such that the model learns how to classify correctly.
Confusion matrices are the basis of cost-benefit matrices, aka the bottom line. For a business, the bottom line is easy to understand through profit and loss analysis. I suppose it’s a lot more complex to determine the bottom line where discrimination against protected classes is involved.
And yet, perhaps it is more urgent and necessary to do this work. There is increased scrutiny on the products we are creating and the biases will be visible and have consequences for our companies.
Machine Learning Bias Caused By Source Data
The largest proportion of machine learning is collecting and cleaning the data that is fed to a model. Data munging is not fun, and thinking about sampling and outliers and population distributions of the training set can be boring, tedious work. Indeed, machines learn bias from the oversights that occur during data munging.
With 2.5 exabytes of data generated every day, there is no shortage of data on which to train our models. There are faces of different colors, with and without glasses, wide eyes and narrow eyes, brown eyes and green eyes.
There are male and female voices, and voices with different accents. Not being culturally aware of the structure of the data set can result in models that are blind or deaf to certain demographics thus marginalizing part of our use groups. Like when Google mistakenly tagged black faces as an album of gorillas. Or when air bags meant to protect passengers put women at risk of death during an accident. These false positives, i.e. the conclusion that you will be safe when you will actually be at risk cost people’s lives.
Earlier this year, one of my friends, a software engineer asked the career adviser if it would be better to use her gender neutral middle name for her resume and LinkedIn to make her job search easier. Her fear isn’t baseless; there are unsurmountable conscious and unconscious gender biases at the workplace. There was even a case where a man and woman switched emails for a short period and saw drastic differences in the way they were being treated.
How to Reduce Machine Learning Bias
However, if we are to teach machines to crawl LinkedIn and resumes, we have the opportunity to scientifically remove the discrimination we humans are unable to overcome. Biased risk assessment algorithms result from models being trained on data that is historically biased. It is possible to intervene and address the historical biases contained in the data such that the model remains aware of gender, age and race without discriminating against or penalizing any protected classes.
The data that seeds a reinforcement learning model can lead to drastically excellent or terrible results. Exponential improvement, or exponential depreciation could lead to increasingly better performing self driving cars that improve with each new ride, or they could convince a D.C. man of the truth of a non-existent sex trafficking ring in D.C.
How do machines learn bias? We teach machines bias through biased training data.
If you enjoyed this piece on data science and machine learning. Feel free to check out some of our other works!
Why Data Science Projects Fail
When Data Science Implementation Goes Wrong
Data Science Consulting Process
Recently, our team of data science consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
We love the fact that that her project was not just technically challenging, but that it was geared towards a bigger purpose than selling toasters or keeping customers from quitting your telecommunication plan! She also brought up her experience interviewing for data science roles at Microsoft and other large corporations and how it taught her so much. We wanted to share what she learned so we asked if she would write us a guest post! And she said yes! So without further ado, here is
How to Prepare for a Data Science Interview:
If you are here, you probably already have a Data Science interview scheduled and are looking for tips on how to prepare so you can crush it. If that’s the case, congratulations on getting past the first two stages of the recruitment pipeline. You have submitted an application and your resume, and perhaps done a take home test. You’ve been offered an interview and you want to make sure you go in ready to blow the minds of your interviewers and walk away with a job offer. Below are tips to help you prepare for your technical phone screens and on-site interviews.
Read the Job Description for the Particular Position You are Interviewing for
Data Scientist roles are still pretty new and the responsibilities vary wildly across industries and across companies. Look at the skills required and the responsibilities for the particular position you are applying for. Make sure that the majority of these are skills that you have, or are willing to learn. For example, if you know Python, you could easily learn R if that’s the language Data Scientists at Company X use. Do you care for web-scraping and inspecting web pages to write web-crawlers? Does analyzing text using different nlp modules excite you? Do you mostly want to write queries to pull dataca from SQL and NoSQL databases and analyse/build models based on this data? Set yourself up for success by leveraging your strengths and interests.
Review your Resume before each Stage of the Interviewing Process
Most interviews will start with questions about your background and how that qualifies you for the position. Having these things at the tip of your fingers will allow you allow you to ease into the interview calmly as you won't be fumbling for answers. Use this time to calm your nerves before the technical questions begin.
Additionally, review your projects and be prepared to talk about the Data Science process you used to design your project. Think about why you chose the tools that you used, the challenges that you encountered along the way, and the things that you learned along the way.
Look at GlassDoor for Past Interview Questions
If you are interviewing for a Data Scientist role at one of the bigger companies, chances are they’ve already interviewed other people before you, who may have shared these questions on GlassDoor. Read them, solve them, get a feel of the questions you will be asked. If you cannot find previous questions for a particular company, solve the data science questions from other companies. They are similar, or at the very least, correlated.
Moreover, even if there are no data science questions for that particular company, see what kind of behavioral questions are asked.
Ask the Recruiter about the Structure of the Interview
Recruiters are often your point of contact with the company you are interviewing at. Ask the recruiter questions about how your interview will be structured, what resources you should use when preparing for your interview, what you should wear to the interview, and even the names of your interviewers so you can stalk look them up on LinkedIn and see their areas of specialization.
Do Mock Interviews
Interviewing can be nerve-racking, more so when you have to whiteboard technical questions. If possible, ask for mock interviews from people who have been through the process before so you know what to expect. If you cannot find someone to do this for you, solve questions on a white board or notebook so you get the feel of writing algorithms some place other than your code editor.
Practice asking questions to understand the scope and constraints of the problem you are solving. Once you are hired, you will not be a siloed data scientist. It is reasonable to bounce around ideas and see if you are on the right track. It is not always about getting the correct answer, which often does not exist, but about how you think through problems, and how you work with other people as well.
Practice the Skills that you Will be Tested On
Your preparation should be informed by the job description and the conversation with recruiters. Study the topics that you know will be on the interview. Look up questions for each area in books and online. Review your statistics, machine learning algorithms, and programming skills.
Additionally, Spring Board has compiled a list of 109 commonly asked Data Science Questions.
KDnuggets also has a list of 21 must know Data Science Interview Questions and Answers.
Follow Up with Thank You Emails
This is probably standard etiquette for any interview but remember to send a personalized thank you email within 24 hours of your interview. Also, if you have thought of the perfect answer to that question you couldn't solve during your interview, include it as well. Don’t forget to express your enthusiasm for the work that Company X does and your desire to work for them.
If you get an offer after your first round of data science interviews, Congratulations! Close this tab and grab a beer. If you are turned down, like most of us are, use the lessons you learned from your past interviews to prepare for your next interviews. Interviews are a good way to identify your areas of weakness, and consequently become a better candidate for future openings. It’s important to stay resilient, patient, and keep a learner’s mindset. Statistically, you probably won't get an offer for each position you apply for. Like the excellent data scientist you are, debug your interviewing process and up your future odds.
Other Great Data Science Blog Posts To Help Make You A Better Data Scientist!
How To Ensure Your Data Science Teams And Projects Succeed!
Why And How To Convince Your Executives To Invest in A Data Science Team?
Data science projects fail all the time! Why is that? Our team of data science consultants have seen many good intentions go wrong because of failure to empower data science teams, locking away access to data, focusing on the wrong problem, and many other problems that could be avoided! We have written 32 of the reasons we have seen data science projects fail. We are sure there are more and would love to get comments on what your teams have seen! What makes a data science project team succeed?
1. The data scientists aren’t given a voice
Data science and strategy can play very nicely together when allowed! Data scientists are more than just over glorified analysts! They have access to possibly all the data a company owns! That means they know every movement the company has made with every outcome (if the data was stored correctly). However, they are often left in the basement with the rest of the tech teams forced to push out reports like any other report developer. There is a reason companies like Amazon, and Google continue to do so well! It is because the people with the Data have a voice!
2. Starting with the wrong questions.
Let’s face it. Most technology people often focus more on how cool a project is, not how much money it will save the company. This can sometimes lead to the wrong business questions being answered! This will lead to a team quickly either failing, or losing value inside of the company. The goal should be to do as much to hit high value business targets as possible. That is what keeps data science projects from failing or at least, being unnoticed.
3.Not addressing the root cause just trying to improve the effect of a process
One of the most dubious and hard to spot until it is too late is not realizing a data science team wasn’t even looking at the actual cause of the problem. When our data science team comes in, one of the things we assess is how a data science team develops their hypotheses. How far do they dig in the data, how many false hypotheses do they think of. How about other causations that could cause a similar output. An outcome can have a very deep root.
4. Weak stakeholder buy-in
Any project, data science, machine learning, construction, or any other department will fail without stakeholder buy in! There needs to be an executives to own the project. This gives a team acknowledgement for their hard work and it also ensures that there will be funding! Without funding, a project will come to a dead halt.
5.Lack of access to data
Slightly attached to the previous point. Locking access away from data scientists, whether it be tools or data is just a waste of time. If a data scientists is forced to spend all day begging a DBAs for access, don’t expect projects to finish any time soon!
6. Using Faulty/Bad Data
Any data specialist (data engineer, analyst, scientist, architect) will tell all managers the cliche saying. Garbage in, garbage out! If the data science team trains a machine learning model on bad data, then it will get bad results. There is no way around it! Even if an algorithm works with 100% accuracy, if all of the data classification is incorrect, then so are the predictions. This will lead to a failed project and executives no longer trusting the data science team.
7.Relying on Excel as the main data storage….or Access
As data science consultants, our team members have come across plenty of analytics and data science projects. Often times, because of lack of support, data scientists and analyst have to construct make shift storage centers because they are not given a sandbox or server to work on. Excel and Access both have their purposes. One of them is not managing large sets of data for analytics purposes. Don’t do that to a data scientists. This will just get poorly designed systems and high turn over!
8. Having a data scientist build their own ETLs
We have seen ETL systems built from R because instead of getting an expert ETL developer a company was allowing the poor data scientists a crack at it. Don’t get us wrong, data scientists are smart people. However, you would much rather have them focus on algorithms and machine learning program implementations instead of spending all day engineering their own data warehouses.
9. Lack of diverse Subject Matter Experts
Data scientists are great with data and often a few subjects that revolve around the data they have worked with. However, data, and businesses are so very different. Sometimes this means a company needs to partner the data science experts with experts. Otherwise, they won’t have the context to better understand complex subjects like manufacturing, pharmaceuticals and avionics.
10.Poorly assessing a team's skills and knowledge of data science tools
If a data science team doesn’t have the skills to work with Hadoop, why would you set up a cluster? It is always good to be aware of a teams skill set first. Otherwise they won’t be able to produce products and solutions at the highest level. Data science tools vary, so make sure you look round before you make any solid decisions.
11.Using technologies because they are cool and not useful
Just because you can use certain tools for a problem. Doesn’t mean it is always the best option. We wouldn’t recommend R for every problem. It is great for research type problems that don’t need to be implemented. If you want a project to get implemented into a larger system, than python or even C++ might be better(depending on the system). Same things goes for Hadoop, or MySQL, or Tableau and Power BI. They all have a place. Don’t let a team do something, just because they can.
12. Lacking an experienced data science leader
Data science is still a new field. That doesn’t mean you don’t need a leader who has some experience working on a data science team. Without one that has a basic understanding of good data science practices. A data science team could struggle to bring projects to fruition. They won’t have a roadmap for success, they will have bad processes and this will just lead to a slew of other problems.
13. Hiring a scientists with limited business understanding
Technology and business are two very different disciplines and sometimes this leads to employees knowing one subject really well and failing to know the other at all. This is ok if a small percentage of the data science team are built up of purely research based employees. It is important to note that some of them should still be very knowledgable of how to act in a business. If you want to help them get up to speed quickly. Check out this list of “How To Survive Corporate Politics as a data scientist”.
14. A boss read one of our blog posts and now thinks he can solve world hunger
Algorithms can’t solve every problem, at least not easily! If this were true, a lot more problems would be solved by now. Having a boss who simply went to a data science conference and now believes he or she can push the data science team to solve every business gap is not reasonable. Limited resources, complexity of subjects, and unstable processes can quickly destroy any project.
15. The solutions are too complex
One mistake executives and data scientists make is thinking their data science models should be complex. It makes sense right, data science is a complex, statistics based subject. This is not true all the time! The simpler you can build a model, or integrate a machine learning solution means a data team will have an easier time maintaining the algorithm in the future.
16. Failing to document
Most technology specialist dislike documentation. It takes time, and it isn’t building new solutions. However, without good documentation, they will never remember what they did 1 month ago, let alone a year ago. This means tracking bugs, tracking how programs work, common fixes, play books, the whole nine yards. Just because data science teams aren’t technically software engineering teams, it doesn’t mean they can step away from documenting how their algorithms work and how they can to their conclusions.
17. The Data science team went with every new request from stakeholders(scope creep).
As with any project, data science teams are susceptible to scope creep. Their stakeholders demand new features every week. They add new data points, and dashboard modules. Suddenly, the data science project seems like it can never be finished. You have half a team focused on a project that managers can’t make their minds up on. Then it will never succeed.
18. Poorly designed models that are not robust or maintainable :
Even well documented bad systems lead to quick failures. Data science projects have lots of moving pieces. Data flowing through ETLs, dashboards, websites, automated report, QA suites, and so one. Any piece of these can take a while to develop, and if developed badly even longer to fix! Nothing is worse then spending an entire FTE on maintaining systems that should be able to run automatically. So spend enough time planning up front that you are not stuck with terrible legacy code.
19. Disagreement on enterprise strategy.
When it comes down to it. Data science offers a huge advantage when implemented well for corporate strategy. That also means the projects being done by some of the more experienced data scientists need to closely align with a directors and executives strategy. Strategies change, so these projects need to come out fast and be focused on maximizing the decisions making of executives. If you are producing a dashboard focused on growth, but an executive team is trying to focus on rebranding, you are wasting time and money!
20. Big data silos or vendor owned data!
You know what is terrible. When data is owned by a vendor. This makes it so hard for data science teams to actually analyze their companies data. Especially if the vendor offers a bad API, none at all or worse, they charge you just to use it. To get a company's data! Imagine, a poor data science budget going to buy back the data! Similarly, if all the data is in silos. It is almost impossible for a data scientists to bring it all together. There are rarely crosswalks or data standards so they are often stuck hopelessly starring at lots of manual work to make data relate.
21 . Problem avoidance(Ignoring the elephant in the room!)
We have all done it! Even data scientists! We know the company has a major problem, it’s the elephant in the room and it could be solved. However, it might be part of company culture, or a problem that no one discusses because it is like the emperor with new clothes. This is sometimes the best place for a data science team to focus.
22. The data science team hasn’t built trust with stakeholders
Let’s be honest. Even if a team develops a 100% accurate algorithm with accurate data, if a team has not been working to build executive trust the entire time, then the project will fail. Why, because every actionable insight a project provides will be questioned, and never implemented.
23. Failing to communicate the value of the data science project
One of the problems our data science consultant team has seen is teams failing to explain the value of a project. This requires...data! You have to use financial numbers, resources saved, competitive advantage gained, etc. To prove to the executives why the project is worth it! The data scientists, use that to help prove their point!
24. Lack of a standardized data science process
No matter how good the data scientists are, without some form of standardization, a team will eventually fail. This may be because a team has to scale and can’t or because a team member leaves. All of this will cause a once working machine to fail.
25. If You Failed To Plan, Plan to Fail
When it comes down to it. There needs to be some amount of planning in the data science projects. You can’t just attempt to find some data sources, make assumptions, attempt to implement some new piece of software without first analyzing the situation! This might take a few weeks and the executives should give you this. If they really want a sustainable piece of software.
26. The data science team competes with other departments(rather than working together)
For some reason or another, office politics exist. Data scientists can often accidently walk over every other department because they are placed in position to help develop strategies and dashboards for the entire company. This might take away jobs from other analysts completely. In turn, this might start fights. So make sure the data science team shares and shows how their projects are helping rather than hurting!
27. Allowing company bias to form conclusions before the data scientists start
Data bias does exist! As a data scientist you can make algorithms and data say whatever you want them to sometimes. However, that doesn’t make it true. Make sure you don’t go into the project with a biased hypothesis that will push you towards early conclusions that might be incorrect.
28. Try to take on to large of a first project
Reading the news about what Google and Facebook are doing with their algorithms may tempt the data science team to take on too large of a project for their first projects. This will not lead to success. You might be lucky and succeed. However, you are taking a huge risk!
29. Manually classifying data
One part of data science that not everyone talks about is data classification. Not just using SVM and KNN algorithms. Nope, we mean actually labeling what the data represents. Someone human has to do that first. Otherwise, the computer will never know how to. If you don’t have a plan on how to classify the data before it gets to the data science team, then someone will have to manually do that. That is one quick way to lose data scientists and have projects fail.
30. Failing to understand what went wrong
Data science projects don’t always succeed. The data science team needs to be able to explain why. As long as it wasn’t a huge drop in the capital budget executives should understand. After all, projects do fail, it is natural. That doesn’t give you an excuse to not know why.
31. Wait to seek out outside help until it is too late
Sometimes the data science team is short on staff, other times you just need new insight. Whatever it might be. The data science team needs to make sure it seeks outside help sooner rather than later. Putting off for help when you know you need it will just lead to awkward conversation with management. They might not want to spend the money, but they also want a project to succeed.
32. Fail to provide actionable insights and opinions
Finally, the data science teams data science project needs to provide actual insight, something actionable. Simply providing a correlation, doesn’t do any good. Executives need decisions, or data to make decisions. If you don’t give them that, you might as well not have a data science team.
If you have any questions, please feel free to comment below! Let us know how we can help!
As more companies turn towards data science and data consulting to help gain competitive advantage. We are getting more people asking us - Why should we start a data science team? We wanted to take a moment and write down some reasons why your company needs to start a data science team. It is not just a fad anymore, it is becoming a need!
Maintain Competitive Advantage
As more and more companies start integrating data science and better analytics into their day to day strategy. It will no longer be a competitive advantage to make data driven decisions. It will be a necessity. Executives will have to rely on accurate data and sustainable metrics to make constantly correct decisions. If your company moves based on the actual why, vs. the speculations and surface level problems, then they can make a greater and more effective impact internally and externally.
Better Understanding of Current and Possible New Customers
Data science helps give executives and managers a deeper understanding into why customers make the choices they do. Google has had the greatest opportunity as it has almost become a third lobe of some people’s brain. We tell it when we are happy, sad, looking for love, studying, etc. It knows what our questions are. However, Google is not the only one, other companies are beginning to realize that their customers have been telling them their opinions on multiple social platforms and blogs. They just need to go and look and see how to better hear from their customers data.
This is a great opportunity for corporations to seek out what their customers are feeling, about their products, about their companies image and what other stuff they are purchasing. Sometimes this may require purchasing third party data. Even then, this might be worth it depending on the projects being done.
Better Understanding of Internal Operations
Whether it is HR, finance, accounting or operations, data science is helping tie all these fields together and paint a better picture of what is happening in a company. Why are people leaving, why was there so much overtime, how can a company be better at utilizing resources, and so on. These questions can be answered by taking high quality data and blending it to find out the whys. This in turn will provide a better work place environment and increase resource efficiency.
Increases Performance Through Data Driven Decisions
Data science is still a relatively new field. Many companies are still figuring out how to properly implement data science solutions. However, those that do are seeing amazing results. Look no further than Amazon, or Uber. These companies are changing and have changed the way we view certain industries. Why, because they know what their customers want, they know what the industry is doing wrong, and they know how to charge the customer less, but give them more.
Overall, increasing data science and analytics proficiency allows for your executives to trust their data more and make clearer decisions. Consider looking into a team today!
Is your company looking to figure out who should become data scientists and how to start a team? You are not alone, even Amazon and Airbnb are starting internal universities to teach more of their teams the values of data science. Maybe your company needs help setting up some internal classes to help increase your data science an machine learning skill sets. Acheron provides multiple forms of internal education programs. They can be for managers, or analysts. One form is a quick guide to how to run a data science team! This a for managers and executives who are starting, or already have a data science team and want to ensure they are getting the best return on investment from their team and that their team members all feel challenged!
We took one sub section out and wanted to share a common question we get when we talk to executives. Who are data scientists, and who should become one! One such client told us they have loads of scientists, but wasn't sure how to turn them into data scientists, and who in their cohorts should really become one.
Below we will go over some of the top soft skills data scientists should have, and what type of personality should someone have before they enroll in some form of data science program. Whether this be an internal program, or external, like Galvanize, or a university data science certificate. In the end, data science is a skill that companies will need to harness to make sure they can keep up with the rest of their competitors who are already successfully implementing data science into their upper level strategy.
Who are Data Scientists?
Data scientist have to be driven individuals. They not only must be technically savvy, they also need to be proactively aware of their company’s nuances. If they happen to see a correlation or pattern, they will seek out how to access the data required and will bring possible projects up to their manager.
Being driven is great, especially when combined with curiosity. Data scientists love to ask why, and not stop until they find out the root cause. They are great at pinpointing that actual patterns in the noise. This is a necessary skill in order to peel apart the complexity and relationships various data sets may have. Occasionally, an individual may have a curious mind, but may lack the drive to act upon their inquiries.
Tolerance of Failure
Data science has a lot of similarities to the science field. In the sense that there might be 99 failed hypotheses that lead to 1 successful solution. Some data driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations every year to year and a half. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc). This means a data scientists must be willing to fail fast and often. Similar to using the agile methodology. They have to constantly test, retest, and prove that their algorithms are correct.
The term data storyteller has become correlated with data scientist. This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields. This requires taking complex mathematical and technological concepts and creating clear and concise messages that executives can act upon. Not just hiding behind their jargon, but actually transcribing their complex ideas into business speak.
Creative and Abstract Thinking
Creativity and abstract thinking helps data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases. Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires thinking outside of the box.
Data scientists have to be able to take large problems, like what ad to show to which customer, then based off of hundreds of variables effectively find the right solution. This means taking a larger problem and breaking it down to its smallest parts. Getting rid of noise, and variables that don’t help create a clear pattern. This can sometimes be a messy process. Being able to keep focused on the bigger problem is key.
Who Should Become a Data Scientist
The skills required to be a data scientist are constantly evolving and many companies are trying to find out how to train new data scientists. In the end, the real question is, who should become a data scientist?
Data science requires constant learning. Not just technology, but it also requires constant learning of new fields, specialties and situations. Especially as data science solutions further integrates into more and more departments of corporations. Becoming familiar with one set of vocabulary, and processes is not an option. Without having some bearing in each field limits the hypothesis and logical assumptions required to be made by a good data scientist.
If you are searching for a data scientist inside your company. They are probably already attempting to push into the field. With all the online material, classes, and meet-ups, an individual would have already taken steps to get more involved. If they merely talk about it, but never act upon it, they will act similarly on a new project or idea.
There is some requirement for computational or technical abilities. Excel is a great tool, but there is a need to be able to use more powerful and customizable tools. This includes programming, data visualization and data storage tools. There is no need to be a software engineer. However, data scientists have a general idea of how to make sure code is maintainable, robust and scalable.
Looking to start a data science team?
If you are looking to start a team of your own. Feel free to comment, or email us! We can do everything from point you in the right direction of readings if you want to do it yourself, to come and join you on your journey! Also, feel free to follow our blog. We will keep it up to date as we do new projects, and new questions about data science! If you email us a question, we will try to post about it!
Unstructured Data, and How to Analyze it!
Content creation and promotion can play a huge role in a company's success on getting their product out there. Think about Star Wars and Marvel. Both of these franchises are just as much commercials for their merchandise, as they are just plain high quality content.
Companies post blogs, make movies, even run pinterest accounts. All of this produces customer responses and network reactions that can be analyzed, melded with current data sets and run through various predictive models to help a company better target users, produce promotional content, and alter products and services to be more in tune with the customer.
Developing a machine learning model can be done by finding value and relationships in all the different forms of data your content produces, segmenting your users and responders, and melding all your data together. In turn, your company can gain a lot more information, besides the standard balance sheet data(see picture above).
Change Words to Numbers
Machine learning has created a host of libraries that can simplify the way your team performs data analysis. In fact, python has several libraries that allow programmers with high level knowledge of data science and machine learning application design and implementation the opportunity to produce fast and meaningful analysis.
One great Python library that can take content data like blogs posts, news articles, and social media posts is TextBlob. TextBlob has some great functions like
“Scary Monsters love to eat tasty, sweet apples”
You can use the lines below to pull out the nouns and what was used to describe said nouns.
How to use TextBlob to Analyze Text Data
This takes data that is very unstructured and hard to analyze, and begins to create a more analysis friendly data sets. Other great uses of this library are projects such as chat bots
From here, you can combine polarity, positivity, shares, topic focus to see what type of social media posts, blog posts, etc, become the most viral.
Another library worth checking out are word2vec which exists in Python, R, Java, etc. For instance, check out deeplearning4j.
Marketing Segmentation with Data Science
Social media allows for once hard to get data such as, people's opinions on products, their likes, dislikes, gender, location, and job to be much more accessible. Sometimes you may have to purchase it, other times, some sites are kind enough to allow you to take it freely.
In either case, this allows companies an open door to segmenting markets with much finer detail. This isn’t based off of small surveys that only have 1000 people, we are talking about millions, and billions of people. Yes, there is a lot more data scrubbing required. But there is an opportunity to segment individuals, and use their networks to support your company's products.
One example is a tweet we once passed off to SQL Server. They quickly responded. Now, based off the fact that we interacted with SQL Server and talk so much about data science and data. You probably can assume we are into technology, databases, etc. This is basically what twitter, facebook, Google, etc do to place the right ads in front of you. They also combine cookies, and other data sources like geolocation.
If you worked for Oracle, perhaps you would want me to see some posts about the benefits of switching to Oracle, or ask for my opinion on why someone prefers(we personally have very little preference, as we have used both, and find both useful) using SQL Server over Oracle. Whatever it may be, there are opportunities to swing customers. Now what if your content was already placed in front of the right people. Maybe you tag a user, or ask them to help you out or join your campaign! Involve them, see how you can help them.
For instance, bloggers are always looking for ways to get their content out their. If your company involves them, or partners with them in a transparent way. Your product now has access to a specific network. Again, another great place where data science and basics statistics come into play.
If you haven’t tried tools like NodeXL, it is a great example of developing a model to find strong influencers in specific networks. This tool is pretty nifty. However, it is limited. So you might want to make some of your own.
Utilizing the data gathered from various sites, and algorithms like K nearest neighbor, PCA, etc. You can find the words used in profiles, posts and shares, the company's customers interact with, etc. Then:
The lists goes, on. It may be better to start with NodeXL, just to see what you are looking for.
Now what is the value of doing all this analysis, data melding, and analytics?
ROI Of Content:
At the end of the day, you have plenty of questions to answer.
These aren’t the easiest question to answer. However, here is where you can help turn the data from your social presence into value for your company:
Typical predictive analytics utilize standard business data(balance sheet, payroll, CRM, and operational data). This limits companies to the “what” happened, and not the why. Managers will ask why did the company see that spike in Q2? Or dip or Q3? It is difficult to paint a picture when you are only looking at the data that has very little insight into the why. Simply doing a running average isn’t always great and putting in seasonal factors is limited to domain knowledge.
However, data has grown, and now, having access to the “Why” is much more plausible. Everything from social media, to CRMs to online news provide much better insight into why your customers are coming or going!
This data has a lot of noise, and it wouldn’t really be worth it for humans to go through it.. This is where having an automated exploratory system developed will help out a lot.
Finding correlations between content, historical news, and company internal data would take analyst's years. By the time they found any value, the moment would have passed.
Instead, having a correlation discovery system that is automated will save your company time, and be much better at finding value. You can use this system to find those small correlating factors that play a big effect. Maybe your customers are telling you what is wrong with your product, and you just aren’t listening. Maybe, you find a new product idea.
In the Acheron Analytics process, this would be part of our second and third phase. We always look for as many possible correlations, and then develop hypotheses and prototypes that leads to company value.
This process allows companies to have data help define their next steps. This provides their managers with data defended plans. Ones that they can go confidently to their managers with.
When it comes to analyzing your company's content and marketing investments, utilizing techniques like machine learning, sentiment analysis, segmentation which can help develop data driven marketing strategies.
We hope this inspired some ideas how to meld your company’s data! Let us know if you have any questions.
Python is a great language for developers and scripters alike. It allows for some large scale design and OOP concepts. However, it was also developed to be very easy to read and design quick scripts! This is great, because data scientists don’t have all day to spend debugging. They do need to spend some time picking out which python languages will work best for their current projects. We at Acheron Analytics have written up a quick list of the 8 most used libraries that can help your next machine learning projects.
P.s....we had a busy week and couldn't get to an actual code example this week as we promised in our last post. However, we are working on that post! We will shortly have an example in R for a from scratch algorithm.
Theano, according to Opensource.com is one of the most heavily used machine learning libraries to date. The great things about Theano, is it is written leaning on mathematical concepts and computer algebra. When the code is compiled it has the ability to to match C level code.
This is due to the fact that it is written to take advantage of how computer compilers work. This in short is how a computer parses and converts tokens into parse trees, how it optimizes and merges similar sub-graphs, using GPU for computations and several other optimizations. For the full list, check out the Theano main page.
For those who used math based languages like Mathamatic and Matlab, the coding structure won’t seem to strange.
What is great, is that Nvidia fully supports Theano and has a few helpful videos on how to use Theano and their GPUs.
When it comes down to it. Machine learning and data science must have good data. How do you handle that data? Well, one great python library is Pandas. It was one of the first data languages many of us were exposed to at Acheron and still has a great following. If you are an R programmer, you will enjoy this language. It allows you to use data frames, which makes thinking about the data you are using much more natural.
Also, if you are a SQL or RDBMS person, this language naturally fits with your tabular view of data. Even if you are more of a Hadoop or MongoDB follower, Pandas just makes life easier.
It doesn’t stop there, it handles missing data, time series, IO and data transformations incredibly well. Thus, if you are trying to prepare your data for analysis, this python language is a must.
We also wanted to share this great python cheat sheet we found, however, we would feel wrong just stick it on our blog. Instead, here is a link to the best python cheat we have found yet! This even beats Datacamp's cheat sheets!
NumPy is another data managing library. Typically you see it paired with Tensorflow, SciPy, matplotlib and so many other python libraries geared towards deep learning and data science. This is because it is built to manage and treat data like matrices. Again, going back to Matlab and R. The purpose is to provide the ability to do complex matrix operations that are required by neural networks and complex statistics easily.
Trying to handle those kind of operations in multi-dimensional arrays or lists is not the most efficient.
Let's say you want to set up an identity matrix? That is one line of code in numpy. Everything about it is geared towards matrices and quick mathematical operations that are done in just a few lines. Coursea has a great course that you can use to further your knowledge about this library.
How to code for an Identity Matrix:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is an odd one out. Scrapy is not a mathematical language, it doesn’t perform data analysis, or deep learning. It does nothing you would think you would want to do in machine learning. However, it does one thing really well. That is, crawl the web. Scrapy is built to be an easy language to develop safe web crawlers (side note, make sure you read all the documentation, it is built to be a safe web crawling library if you configure it right and that is something you have to research).
The web is a great source of unstructured, structured, and visual data. As long as a sight approves of you crawling and doesn’t mind you using their content(which we are not responsible for figuring out) you can gain a lot of insight into topics. You can use libraries that take words and put them into vectors to help perform analysis, or sentiment analysis, etc. It is much more difficult than using straightforward numbers. It is also much richer. There is alot to be gained fom pictures, words, and unstructured data. With that comes the task of getting that information how of the complex data.
That being said Pattern is another specialized web mining scraper. It has tools for Natural Language Processing(NLP), and Machine Learning. It has several built in algorithms and really makes your life as developer much easier!
We have discussed several libraries such as matplotlib, numPy and Pandas and how great they are for machine learning and data science. Now, imagine if you built and easy to use library on top of all of those, as well as several other easy to use libraries. Well, that is what scikit-learn is. It is a compilation of these libraries to create easy access to complex data science algorithms, data visualization techniques.It can be used for clustering, transforming data, dimensional reduction (reducing the number of features that exist), ensemble methods, feature selection and a lot of other classic data science techniques and they are all basically done in a few lines!
The hardest part is making sure you have a virtual python library when you pip install!
matplotlib and ggplot
Now you have done all this analysis, and run all your algorithms. What now? How do you actually turn around value from all this data you have. How do you inspire your executives and tell them “Stories” full of “Insight” etc. If you don’t want to mess around with D3.js, python has you covered! Using Libraries like matplotlib and ggplot. Both are really built to mimic matlab and R functionality. Matplotlib has some great 3D graphs that will help you visualize your knn and PCA algorithms and clusters.
When you are in your data exploration phase, hypothesis, and final product phase of a product. Using these three languages makes life much easier. You can visualize your data, its quirks and your final results!
We have discussed Tensorflow before on this blog when we talked about some common libraries used by data science professionals. It doesn't hurt to talk about it again though! The fact is, if you are in the world of machine learning, you have probably heard, tried, or implemented some form of deep learning algorithm. Are they necessary, not all the time. Are they cool when done right, yes.
Tensorflow and Theano are very similar. The interesting thing about Tensorflow, is that when you are writing in python, you are really only design a graph for the compiler to compile into C++ code and then run on either your CPU or GPU. This is what makes this language so effective and easy to work with. Instead of having to write at the C++ or CUDA level, you can code it all in python first.
The difficulty comes in actually understanding how to properly set up a neural network, convolutional network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data dropout or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).
These are just a few of the most commonly mentioned python libraries that are utilized by academics and professionals. Do you agree? Feel free to share what languages, libraries and tools you use, even if they aren’t python!
During our last post, we discussed a key step in preparing your team for implementing a new data science solution(How to Engineer Your Data). The step following preparing your data is automation. Automation is key to AI and Machine learning. You don’t want to be filling in fields, copy and pasting from Excel, or babying ETLs. Each time data is processed, you want to have some form of automated process that gets kicked off at a regular interval that helps analyze, transform and check your data as it moves from point a to point b.
Before we can go off and discuss analysis, engineering and QA. We must first assess what tools your company uses. Now, the tools you choose to work with for automation are all up to what you are comfortable with.
If you are a linux lover, you will probably pick Crontab and Watch. Windows users will lean towards task scheduler, the end result is the same. You could choose other tools
Once you know what tool will be running your automation, you need to pick some form of scripting language. This could be python, bash, even powershell. Just because it is a scripting language, we still would recommend creating some form of file structure that acts as an organizer. For instance:
This makes it easier on developers past, present and future to follow code when they have to maintain it. Of course, you might have a different file structure, which is great! Just be consistent.
The Set up:
To describe a very basic set up. We would recommend starting out with some form of file landing zone. Whether this is an FTP or a shared drive. Some location where the scripts have access to needs to be set up.
From there, it would be best to have some RDBMS (Mysql, MSSQL, Oracle, etc) that acts as a file tracking system. This will track when new files get placed into your file storage area, what type of file they are, when it was read, etc. Consider this some form of meta table. At the beginning, it can be very basic.
Just have the layout below:
The key for automation is the final column. Having a flag column that distinguishes whether a file has been read or not. There are also other tables you might want around this. For instance, an error table, a dimension table that could contain customers attached to files info, etc.
How does that info get there? An automation script of course! Have some script whose job is to place new file metadata into the system.
Following this, you will have a few other scripts for analysis, data movement and QA that are all separate. This way, if one side fails, you don’t lose all functionality. If you can’t load, you just can’t load and if you can’t process data, you just can’t process it.
When starting any form of data science or machine learning project. The engineers may have limited knowledge of the data they are working with. They might not know what biases exist, missing data, or other quirks of the data. This all needs to be sorted out quickly. If your data science team is manually creating scripts to do this work for each individual data set. They are losing valuable time. Once data sets are assigned, they should be processed by an automated set of scripts that can either be called using a command line prompt, or even better, automatically.
These basic scripts often contain histograms, correlation matrixes, clustering algorithms, and some straight forward algorithms that require 'N' amount of variables and have a specified list of outputs. This could be logistic regression, knn, and Principle Component Analysis(PCA) for starters. In addition, following each model a summary function of some kind can be run. If using R, this is simply summary().
A function example that we have used as part of previous exploration automation:
Basic Correlation Matrix
Data Engineering Phase
Once you have finished exploring your data, it is important to plan how that data will then be stored and what form of analytics can be done on the front end. Can you analyze sentiment, topic focus and value ratios? Do you need to restructure and normalize the data(not the same as statistical normalization).
Guess what! All of this can be automated. Following the explore phase, you can start to design how the system will ingest the data. This will require some manual processing up front to ensure the solution can scale. However, even this should be built in a way that allows for an easy transition to an automated system. Thus, it should be robust, and systemized from the start! That is one of our key driving factors whenever we design a system at Acheron Analytics. It might start being run from command line, but it should easily integrate to being run by task scheduler or cron. This means thinking about the entire process, the variables that will be shared between databases and scripts, the try/catch mechanisms, and possible hiccups along the way.
The system needs to be able to handle failure well. It will allow your team more time to focus on the actual advantages data science, machine learning and even standard analytics provide. Tie this together with a solid logging system, and your team won't have to spend hours or days trouble shooting simple big data errors.
This is one of the most crucial phases for data management and automation. Qing data is a rare skill. Most QAs specialize in software engineering and less in how to test data accuracy. We have had experience watching companies as they try to find a QA with the right skills that match their data processes, or data engineers who are also very good at QAing their own work. It isn’t easy.
Having a test suite built with multiple test cases that run on every new set of data introduced is vital! And if you happen to make it dymaic when new approved data sets are inserted for upper and lower bounds tests...who are we to disagree!
Ensuring all the data that goes into your system automatically can save anywhere several FTE positions. Depending on how large and complex your data is. A good QA system can manage several data applications with a single person.
The question is, what are you checking? If you don’t have a full fledged Data QA on board, this might not be straightforward. So we have a few bullet points to help you get your team thinking about how to set up their data test suites.
What you and your team need to think about when you create test Suites:
Overall, automation helps save your data science and machine learning projects from getting bogged down with basic ETL, and data checking work. This way, your data science teams can make some major insights efficiently, without being limited because of maintenance and reconfiguring tasks. We have seen many teams, both in analytics and data science lose time because of poorly designed processes from the get go. Once a system is plugged into the organization, it is much harder to modify. So make sure to plan automation early!
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!