Python is a great language for developers and scripters alike. It allows for some large scale design and OOP concepts. However, it was also developed to be very easy to read and design quick scripts! This is great, because data scientists don’t have all day to spend debugging. They do need to spend some time picking out which python languages will work best for their current projects. We at Acheron Analytics have written up a quick list of the 8 most used libraries that can help your next machine learning projects.
P.s....we had a busy week and couldn't get to an actual code example this week as we promised in our last post. However, we are working on that post! We will shortly have an example in R for a from scratch algorithm.
Theano, according to Opensource.com is one of the most heavily used machine learning libraries to date. The great things about Theano, is it is written leaning on mathematical concepts and computer algebra. When the code is compiled it has the ability to to match C level code.
This is due to the fact that it is written to take advantage of how computer compilers work. This in short is how a computer parses and converts tokens into parse trees, how it optimizes and merges similar sub-graphs, using GPU for computations and several other optimizations. For the full list, check out the Theano main page.
For those who used math based languages like Mathamatic and Matlab, the coding structure won’t seem to strange.
What is great, is that Nvidia fully supports Theano and has a few helpful videos on how to use Theano and their GPUs.
When it comes down to it. Machine learning and data science must have good data. How do you handle that data? Well, one great python library is Pandas. It was one of the first data languages many of us were exposed to at Acheron and still has a great following. If you are an R programmer, you will enjoy this language. It allows you to use data frames, which makes thinking about the data you are using much more natural.
Also, if you are a SQL or RDBMS person, this language naturally fits with your tabular view of data. Even if you are more of a Hadoop or MongoDB follower, Pandas just makes life easier.
It doesn’t stop there, it handles missing data, time series, IO and data transformations incredibly well. Thus, if you are trying to prepare your data for analysis, this python language is a must.
We also wanted to share this great python cheat sheet we found, however, we would feel wrong just stick it on our blog. Instead, here is a link to the best python cheat we have found yet! This even beats Datacamp's cheat sheets!
NumPy is another data managing library. Typically you see it paired with Tensorflow, SciPy, matplotlib and so many other python libraries geared towards deep learning and data science. This is because it is built to manage and treat data like matrices. Again, going back to Matlab and R. The purpose is to provide the ability to do complex matrix operations that are required by neural networks and complex statistics easily.
Trying to handle those kind of operations in multi-dimensional arrays or lists is not the most efficient.
Let's say you want to set up an identity matrix? That is one line of code in numpy. Everything about it is geared towards matrices and quick mathematical operations that are done in just a few lines. Coursea has a great course that you can use to further your knowledge about this library.
How to code for an Identity Matrix:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is an odd one out. Scrapy is not a mathematical language, it doesn’t perform data analysis, or deep learning. It does nothing you would think you would want to do in machine learning. However, it does one thing really well. That is, crawl the web. Scrapy is built to be an easy language to develop safe web crawlers (side note, make sure you read all the documentation, it is built to be a safe web crawling library if you configure it right and that is something you have to research).
The web is a great source of unstructured, structured, and visual data. As long as a sight approves of you crawling and doesn’t mind you using their content(which we are not responsible for figuring out) you can gain a lot of insight into topics. You can use libraries that take words and put them into vectors to help perform analysis, or sentiment analysis, etc. It is much more difficult than using straightforward numbers. It is also much richer. There is alot to be gained fom pictures, words, and unstructured data. With that comes the task of getting that information how of the complex data.
That being said Pattern is another specialized web mining scraper. It has tools for Natural Language Processing(NLP), and Machine Learning. It has several built in algorithms and really makes your life as developer much easier!
We have discussed several libraries such as matplotlib, numPy and Pandas and how great they are for machine learning and data science. Now, imagine if you built and easy to use library on top of all of those, as well as several other easy to use libraries. Well, that is what scikit-learn is. It is a compilation of these libraries to create easy access to complex data science algorithms, data visualization techniques.It can be used for clustering, transforming data, dimensional reduction (reducing the number of features that exist), ensemble methods, feature selection and a lot of other classic data science techniques and they are all basically done in a few lines!
The hardest part is making sure you have a virtual python library when you pip install!
matplotlib and ggplot
Now you have done all this analysis, and run all your algorithms. What now? How do you actually turn around value from all this data you have. How do you inspire your executives and tell them “Stories” full of “Insight” etc. If you don’t want to mess around with D3.js, python has you covered! Using Libraries like matplotlib and ggplot. Both are really built to mimic matlab and R functionality. Matplotlib has some great 3D graphs that will help you visualize your knn and PCA algorithms and clusters.
When you are in your data exploration phase, hypothesis, and final product phase of a product. Using these three languages makes life much easier. You can visualize your data, its quirks and your final results!
We have discussed Tensorflow before on this blog when we talked about some common libraries used by data science professionals. It doesn't hurt to talk about it again though! The fact is, if you are in the world of machine learning, you have probably heard, tried, or implemented some form of deep learning algorithm. Are they necessary, not all the time. Are they cool when done right, yes.
Tensorflow and Theano are very similar. The interesting thing about Tensorflow, is that when you are writing in python, you are really only design a graph for the compiler to compile into C++ code and then run on either your CPU or GPU. This is what makes this language so effective and easy to work with. Instead of having to write at the C++ or CUDA level, you can code it all in python first.
The difficulty comes in actually understanding how to properly set up a neural network, convolutional network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data dropout or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).
These are just a few of the most commonly mentioned python libraries that are utilized by academics and professionals. Do you agree? Feel free to share what languages, libraries and tools you use, even if they aren’t python!
During our last post, we discussed a key step in preparing your team for implementing a new data science solution(How to Engineer Your Data). The step following preparing your data is automation. Automation is key to AI and Machine learning. You don’t want to be filling in fields, copy and pasting from Excel, or babying ETLs. Each time data is processed, you want to have some form of automated process that gets kicked off at a regular interval that helps analyze, transform and check your data as it moves from point a to point b.
Before we can go off and discuss analysis, engineering and QA. We must first assess what tools your company uses. Now, the tools you choose to work with for automation are all up to what you are comfortable with.
If you are a linux lover, you will probably pick Crontab and Watch. Windows users will lean towards task scheduler, the end result is the same. You could choose other tools
Once you know what tool will be running your automation, you need to pick some form of scripting language. This could be python, bash, even powershell. Just because it is a scripting language, we still would recommend creating some form of file structure that acts as an organizer. For instance:
This makes it easier on developers past, present and future to follow code when they have to maintain it. Of course, you might have a different file structure, which is great! Just be consistent.
The Set up:
To describe a very basic set up. We would recommend starting out with some form of file landing zone. Whether this is an FTP or a shared drive. Some location where the scripts have access to needs to be set up.
From there, it would be best to have some RDBMS (Mysql, MSSQL, Oracle, etc) that acts as a file tracking system. This will track when new files get placed into your file storage area, what type of file they are, when it was read, etc. Consider this some form of meta table. At the beginning, it can be very basic.
Just have the layout below:
The key for automation is the final column. Having a flag column that distinguishes whether a file has been read or not. There are also other tables you might want around this. For instance, an error table, a dimension table that could contain customers attached to files info, etc.
How does that info get there? An automation script of course! Have some script whose job is to place new file metadata into the system.
Following this, you will have a few other scripts for analysis, data movement and QA that are all separate. This way, if one side fails, you don’t lose all functionality. If you can’t load, you just can’t load and if you can’t process data, you just can’t process it.
When starting any form of data science or machine learning project. The engineers may have limited knowledge of the data they are working with. They might not know what biases exist, missing data, or other quirks of the data. This all needs to be sorted out quickly. If your data science team is manually creating scripts to do this work for each individual data set. They are losing valuable time. Once data sets are assigned, they should be processed by an automated set of scripts that can either be called using a command line prompt, or even better, automatically.
These basic scripts often contain histograms, correlation matrixes, clustering algorithms, and some straight forward algorithms that require 'N' amount of variables and have a specified list of outputs. This could be logistic regression, knn, and Principle Component Analysis(PCA) for starters. In addition, following each model a summary function of some kind can be run. If using R, this is simply summary().
A function example that we have used as part of previous exploration automation:
Basic Correlation Matrix
Data Engineering Phase
Once you have finished exploring your data, it is important to plan how that data will then be stored and what form of analytics can be done on the front end. Can you analyze sentiment, topic focus and value ratios? Do you need to restructure and normalize the data(not the same as statistical normalization).
Guess what! All of this can be automated. Following the explore phase, you can start to design how the system will ingest the data. This will require some manual processing up front to ensure the solution can scale. However, even this should be built in a way that allows for an easy transition to an automated system. Thus, it should be robust, and systemized from the start! That is one of our key driving factors whenever we design a system at Acheron Analytics. It might start being run from command line, but it should easily integrate to being run by task scheduler or cron. This means thinking about the entire process, the variables that will be shared between databases and scripts, the try/catch mechanisms, and possible hiccups along the way.
The system needs to be able to handle failure well. It will allow your team more time to focus on the actual advantages data science, machine learning and even standard analytics provide. Tie this together with a solid logging system, and your team won't have to spend hours or days trouble shooting simple big data errors.
This is one of the most crucial phases for data management and automation. Qing data is a rare skill. Most QAs specialize in software engineering and less in how to test data accuracy. We have had experience watching companies as they try to find a QA with the right skills that match their data processes, or data engineers who are also very good at QAing their own work. It isn’t easy.
Having a test suite built with multiple test cases that run on every new set of data introduced is vital! And if you happen to make it dymaic when new approved data sets are inserted for upper and lower bounds tests...who are we to disagree!
Ensuring all the data that goes into your system automatically can save anywhere several FTE positions. Depending on how large and complex your data is. A good QA system can manage several data applications with a single person.
The question is, what are you checking? If you don’t have a full fledged Data QA on board, this might not be straightforward. So we have a few bullet points to help you get your team thinking about how to set up their data test suites.
What you and your team need to think about when you create test Suites:
Overall, automation helps save your data science and machine learning projects from getting bogged down with basic ETL, and data checking work. This way, your data science teams can make some major insights efficiently, without being limited because of maintenance and reconfiguring tasks. We have seen many teams, both in analytics and data science lose time because of poorly designed processes from the get go. Once a system is plugged into the organization, it is much harder to modify. So make sure to plan automation early!
In the era of data science and AI, it is easy to skip over some crucial steps such as data cleansing. However, this can cause major problems in your applications later down in the data pipeline. The promise of possible magic like data science solutions can overshadow the necessary steps required to get to the best final product. One such step is cleaning and engineering your data before it even gets placed into your system. Truthfully, this is not limited to data science. Whether you are doing data analytics, data science, machine learning, or just old fashioned statistics, data is never whole and pure before refining. Just like putting bad unprocessed petroleum into your car, putting unprocessed data into your company's systems will either immediately, or eventually wreak havoc(Here are some examples). Whether that means actually causing software to fail, or giving executives bad information both are unacceptable.
We at Acheron Analytics wanted to share few tips to ensure that whatever data science/analytics projects you are taking on, you and your team are successful. This post will go over have some brief examples in R, Python and SQL, feel free to reach out with any questions.
Duplicate data is the scourge of any analyst. Whether you are just using excel, Mysql, or Hadoop. Making sure your systems don’t produce duplicate data is key.
There are several sources to duplicate data. The first comes from when the data is input into your companies data storage system. There is a chance that the same data may try to sneak its way in. This could be due to end-user error, a glitch in the system, a bad ETL, etc. All of this should be managed by your data system. Most people still use RDBMS and thus, using a unique key will avoid duplicates being inserted. Sometimes, this may require a combination of fields to check and see if the data being input is a duplicate. For instance, if you are looking at a vendor invoice line item, you probably shouldn’t have the same line item number and header id twice. This can become more complicated when line items change(but even that can be accounted for). If you are analyzing social media post data, each snapshot you take may have the same post id but have altered social interaction data (likes, retweets, shares, etc). This references slowly changing dimensions, which, is another great topic for another time. Feel free to read up more on the topic here.
In both cases, your systems should be calibrated to safely throw out the duplicate data and store the errors in some error table. All of this will save your team time and confusion later.
Besides the actual source data itself having duplicates. The other common duplicate that can occur is based off an analyst's query. If, by chance, they accidentally don’t have a 1:1 or 1 : Many relationship on the key they are joining on, they may find themselves with several times the amount of data you started with. This could be as simple as restructuring your team's query to make sure they properly create 1:1 relationships, or...you may have to completely restructure your database. It is more likely the former option.
How to Get Rid of Duplicate Data in SQL
Has your company ever purchased data from a data aggregator and found it filled with holes? Missing data is common across every industry, sometimes it is just due to system upgrades and new features being added in, sometimes just bad data gathering. Whatever it might be, this can really skew a data science projects results. What are your options then? You could ignore rows with missing data, but this might cost your company valuable insight and including the gaps will produce incorrect conclusions. So, how do you win?
There are few different thoughts on this. One is to simply put a random and reasonable number in place of nothing. This doesn’t really make sense, as it is difficult to really tell what is being driven by what feature. What is a more common and reasonable practice is using the data set average. However, even this is a little misleading. For instance, on one project we were involved with, we were analyzing a large population of users and their sociometric data(income, neighborhood trends, shopping habits). About 15% of the data was missing that was purchased from a credit card carrier. So throwing it away was not in our best interest.
Instead, because we had the persons zipcodes, we were able to aggregate at a local level. This was a judgement call. A good one in this case. We compared this to averaging the entire data set, and we really got a much clearer picture on our populations features. The problem with a general average over several hundred thousand people is that you will eventually have some odd sways. For instance, income, if your data set is a good distribution, you will end up with your average income being, well, average. Then, suddenly, people that may have lived in richer neighborhoods may suddenly create their own classification. The difference between 400k vs 50k(even when normalized) can drastically alter the rest of the features. Does it really make sense for someone who is making 50K a year to be purchasing over 100k of products a year? In the end, we would get a strange cluster that was large spenders, who made average income. When your focus is socio-economic factors. This can cause some major discrepancies.
How to Handle Missing Data with SQL
Data normalization is one of the first critical steps to making sure your data sensible to run in most algorithms. Simply trying to feed in variables that could be anything from age, income, computer usage time, etc, creates the hassle of trying to compare apple to oranges. Trying to input 400k to 40 years will create bad outputs. The numbers just don’t scale. Instead, the concept of normalization allows your data to be more comparable. It takes the max and min of a data set and sets them to the 0 and 1 of a scale. Now, the rest of the numbers can be scaled. Utilizing 0-1 allows your data science teams to meld the data smoother. They are no longer trying to compare scales that don't match. This is a necessary step in most cases to ensure success.
R Progamming Normalization
Python(This can also depend on whether you are using Numpy, Pandas, etc)
Data preparation can be one of the longer steps when preparing your teams data science project. However, once the data is cleaned, checked, and properly shaped, it is much easier to pull out features, and create accurate insights. Preparation is half the battle. Once the data is organized, it becomes several times easier to mold. Good luck with your future data science projects and feel free to give us a ring here in Seattle if you have more questions about your data science projects
Future Learning! And Other Data Transformations
We wanted to supply some more tools to help you learn how to transform and engineer your data. Here is a great video that covers several data transforms. This particular video relies on the R programming language.
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!