During our last post, we discussed a key step in preparing your team for implementing a new data science solution(How to Engineer Your Data). The step following preparing your data is automation. Automation is key to AI and Machine learning. You don’t want to be filling in fields, copy and pasting from Excel, or babying ETLs. Each time data is processed, you want to have some form of automated process that gets kicked off at a regular interval that helps analyze, transform and check your data as it moves from point a to point b.
Before we can go off and discuss analysis, engineering and QA. We must first assess what tools your company uses. Now, the tools you choose to work with for automation are all up to what you are comfortable with.
If you are a linux lover, you will probably pick Crontab and Watch. Windows users will lean towards task scheduler, the end result is the same. You could choose other tools
Once you know what tool will be running your automation, you need to pick some form of scripting language. This could be python, bash, even powershell. Just because it is a scripting language, we still would recommend creating some form of file structure that acts as an organizer. For instance:
This makes it easier on developers past, present and future to follow code when they have to maintain it. Of course, you might have a different file structure, which is great! Just be consistent.
The Set up:
To describe a very basic set up. We would recommend starting out with some form of file landing zone. Whether this is an FTP or a shared drive. Some location where the scripts have access to needs to be set up.
From there, it would be best to have some RDBMS (Mysql, MSSQL, Oracle, etc) that acts as a file tracking system. This will track when new files get placed into your file storage area, what type of file they are, when it was read, etc. Consider this some form of meta table. At the beginning, it can be very basic.
Just have the layout below:
The key for automation is the final column. Having a flag column that distinguishes whether a file has been read or not. There are also other tables you might want around this. For instance, an error table, a dimension table that could contain customers attached to files info, etc.
How does that info get there? An automation script of course! Have some script whose job is to place new file metadata into the system.
Following this, you will have a few other scripts for analysis, data movement and QA that are all separate. This way, if one side fails, you don’t lose all functionality. If you can’t load, you just can’t load and if you can’t process data, you just can’t process it.
When starting any form of data science or machine learning project. The engineers may have limited knowledge of the data they are working with. They might not know what biases exist, missing data, or other quirks of the data. This all needs to be sorted out quickly. If your data science team is manually creating scripts to do this work for each individual data set. They are losing valuable time. Once data sets are assigned, they should be processed by an automated set of scripts that can either be called using a command line prompt, or even better, automatically.
These basic scripts often contain histograms, correlation matrixes, clustering algorithms, and some straight forward algorithms that require 'N' amount of variables and have a specified list of outputs. This could be logistic regression, knn, and Principle Component Analysis(PCA) for starters. In addition, following each model a summary function of some kind can be run. If using R, this is simply summary().
A function example that we have used as part of previous exploration automation:
Basic Correlation Matrix
Data Engineering Phase
Once you have finished exploring your data, it is important to plan how that data will then be stored and what form of analytics can be done on the front end. Can you analyze sentiment, topic focus and value ratios? Do you need to restructure and normalize the data(not the same as statistical normalization).
Guess what! All of this can be automated. Following the explore phase, you can start to design how the system will ingest the data. This will require some manual processing up front to ensure the solution can scale. However, even this should be built in a way that allows for an easy transition to an automated system. Thus, it should be robust, and systemized from the start! That is one of our key driving factors whenever we design a system at Acheron Analytics. It might start being run from command line, but it should easily integrate to being run by task scheduler or cron. This means thinking about the entire process, the variables that will be shared between databases and scripts, the try/catch mechanisms, and possible hiccups along the way.
The system needs to be able to handle failure well. It will allow your team more time to focus on the actual advantages data science, machine learning and even standard analytics provide. Tie this together with a solid logging system, and your team won't have to spend hours or days trouble shooting simple big data errors.
This is one of the most crucial phases for data management and automation. Qing data is a rare skill. Most QAs specialize in software engineering and less in how to test data accuracy. We have had experience watching companies as they try to find a QA with the right skills that match their data processes, or data engineers who are also very good at QAing their own work. It isn’t easy.
Having a test suite built with multiple test cases that run on every new set of data introduced is vital! And if you happen to make it dymaic when new approved data sets are inserted for upper and lower bounds tests...who are we to disagree!
Ensuring all the data that goes into your system automatically can save anywhere several FTE positions. Depending on how large and complex your data is. A good QA system can manage several data applications with a single person.
The question is, what are you checking? If you don’t have a full fledged Data QA on board, this might not be straightforward. So we have a few bullet points to help you get your team thinking about how to set up their data test suites.
What you and your team need to think about when you create test Suites:
Overall, automation helps save your data science and machine learning projects from getting bogged down with basic ETL, and data checking work. This way, your data science teams can make some major insights efficiently, without being limited because of maintenance and reconfiguring tasks. We have seen many teams, both in analytics and data science lose time because of poorly designed processes from the get go. Once a system is plugged into the organization, it is much harder to modify. So make sure to plan automation early!
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!