Is your company looking to figure out who should become data scientists and how to start a team? You are not alone, even Amazon and Airbnb are starting internal universities to teach more of their teams the values of data science. Maybe your company needs help setting up some internal classes to help increase your data science an machine learning skill sets. Acheron provides multiple forms of internal education programs. They can be for managers, or analysts. One form is a quick guide to how to run a data science team! This a for managers and executives who are starting, or already have a data science team and want to ensure they are getting the best return on investment from their team and that their team members all feel challenged!
We took one sub section out and wanted to share a common question we get when we talk to executives. Who are data scientists, and who should become one! One such client told us they have loads of scientists, but wasn't sure how to turn them into data scientists, and who in their cohorts should really become one.
Below we will go over some of the top soft skills data scientists should have, and what type of personality should someone have before they enroll in some form of data science program. Whether this be an internal program, or external, like Galvanize, or a university data science certificate. In the end, data science is a skill that companies will need to harness to make sure they can keep up with the rest of their competitors who are already successfully implementing data science into their upper level strategy.
Who are Data Scientists?
Data scientist have to be driven individuals. They not only must be technically savvy, they also need to be proactively aware of their company’s nuances. If they happen to see a correlation or pattern, they will seek out how to access the data required and will bring possible projects up to their manager.
Being driven is great, especially when combined with curiosity. Data scientists love to ask why, and not stop until they find out the root cause. They are great at pinpointing that actual patterns in the noise. This is a necessary skill in order to peel apart the complexity and relationships various data sets may have. Occasionally, an individual may have a curious mind, but may lack the drive to act upon their inquiries.
Tolerance of Failure
Data science has a lot of similarities to the science field. In the sense that there might be 99 failed hypotheses that lead to 1 successful solution. Some data driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations every year to year and a half. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc). This means a data scientists must be willing to fail fast and often. Similar to using the agile methodology. They have to constantly test, retest, and prove that their algorithms are correct.
The term data storyteller has become correlated with data scientist. This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields. This requires taking complex mathematical and technological concepts and creating clear and concise messages that executives can act upon. Not just hiding behind their jargon, but actually transcribing their complex ideas into business speak.
Creative and Abstract Thinking
Creativity and abstract thinking helps data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases. Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires thinking outside of the box.
Data scientists have to be able to take large problems, like what ad to show to which customer, then based off of hundreds of variables effectively find the right solution. This means taking a larger problem and breaking it down to its smallest parts. Getting rid of noise, and variables that don’t help create a clear pattern. This can sometimes be a messy process. Being able to keep focused on the bigger problem is key.
Who Should Become a Data Scientist
The skills required to be a data scientist are constantly evolving and many companies are trying to find out how to train new data scientists. In the end, the real question is, who should become a data scientist?
Data science requires constant learning. Not just technology, but it also requires constant learning of new fields, specialties and situations. Especially as data science solutions further integrates into more and more departments of corporations. Becoming familiar with one set of vocabulary, and processes is not an option. Without having some bearing in each field limits the hypothesis and logical assumptions required to be made by a good data scientist.
If you are searching for a data scientist inside your company. They are probably already attempting to push into the field. With all the online material, classes, and meet-ups, an individual would have already taken steps to get more involved. If they merely talk about it, but never act upon it, they will act similarly on a new project or idea.
There is some requirement for computational or technical abilities. Excel is a great tool, but there is a need to be able to use more powerful and customizable tools. This includes programming, data visualization and data storage tools. There is no need to be a software engineer. However, data scientists have a general idea of how to make sure code is maintainable, robust and scalable.
Looking to start a data science team?
If you are looking to start a team of your own. Feel free to comment, or email us! We can do everything from point you in the right direction of readings if you want to do it yourself, to come and join you on your journey! Also, feel free to follow our blog. We will keep it up to date as we do new projects, and new questions about data science! If you email us a question, we will try to post about it!
Unstructured Data, and How to Analyze it!
Content creation and promotion can play a huge role in a company's success on getting their product out there. Think about Star Wars and Marvel. Both of these franchises are just as much commercials for their merchandise, as they are just plain high quality content.
Companies post blogs, make movies, even run pinterest accounts. All of this produces customer responses and network reactions that can be analyzed, melded with current data sets and run through various predictive models to help a company better target users, produce promotional content, and alter products and services to be more in tune with the customer.
Developing a machine learning model can be done by finding value and relationships in all the different forms of data your content produces, segmenting your users and responders, and melding all your data together. In turn, your company can gain a lot more information, besides the standard balance sheet data(see picture above).
Change Words to Numbers
Machine learning has created a host of libraries that can simplify the way your team performs data analysis. In fact, python has several libraries that allow programmers with high level knowledge of data science and machine learning application design and implementation the opportunity to produce fast and meaningful analysis.
One great Python library that can take content data like blogs posts, news articles, and social media posts is TextBlob. TextBlob has some great functions like
“Scary Monsters love to eat tasty, sweet apples”
You can use the lines below to pull out the nouns and what was used to describe said nouns.
How to use TextBlob to Analyze Text Data
This takes data that is very unstructured and hard to analyze, and begins to create a more analysis friendly data sets. Other great uses of this library are projects such as chat bots
From here, you can combine polarity, positivity, shares, topic focus to see what type of social media posts, blog posts, etc, become the most viral.
Another library worth checking out are word2vec which exists in Python, R, Java, etc. For instance, check out deeplearning4j.
Marketing Segmentation with Data Science
Social media allows for once hard to get data such as, people's opinions on products, their likes, dislikes, gender, location, and job to be much more accessible. Sometimes you may have to purchase it, other times, some sites are kind enough to allow you to take it freely.
In either case, this allows companies an open door to segmenting markets with much finer detail. This isn’t based off of small surveys that only have 1000 people, we are talking about millions, and billions of people. Yes, there is a lot more data scrubbing required. But there is an opportunity to segment individuals, and use their networks to support your company's products.
One example is a tweet we once passed off to SQL Server. They quickly responded. Now, based off the fact that we interacted with SQL Server and talk so much about data science and data. You probably can assume we are into technology, databases, etc. This is basically what twitter, facebook, Google, etc do to place the right ads in front of you. They also combine cookies, and other data sources like geolocation.
If you worked for Oracle, perhaps you would want me to see some posts about the benefits of switching to Oracle, or ask for my opinion on why someone prefers(we personally have very little preference, as we have used both, and find both useful) using SQL Server over Oracle. Whatever it may be, there are opportunities to swing customers. Now what if your content was already placed in front of the right people. Maybe you tag a user, or ask them to help you out or join your campaign! Involve them, see how you can help them.
For instance, bloggers are always looking for ways to get their content out their. If your company involves them, or partners with them in a transparent way. Your product now has access to a specific network. Again, another great place where data science and basics statistics come into play.
If you haven’t tried tools like NodeXL, it is a great example of developing a model to find strong influencers in specific networks. This tool is pretty nifty. However, it is limited. So you might want to make some of your own.
Utilizing the data gathered from various sites, and algorithms like K nearest neighbor, PCA, etc. You can find the words used in profiles, posts and shares, the company's customers interact with, etc. Then:
The lists goes, on. It may be better to start with NodeXL, just to see what you are looking for.
Now what is the value of doing all this analysis, data melding, and analytics?
ROI Of Content:
At the end of the day, you have plenty of questions to answer.
These aren’t the easiest question to answer. However, here is where you can help turn the data from your social presence into value for your company:
Typical predictive analytics utilize standard business data(balance sheet, payroll, CRM, and operational data). This limits companies to the “what” happened, and not the why. Managers will ask why did the company see that spike in Q2? Or dip or Q3? It is difficult to paint a picture when you are only looking at the data that has very little insight into the why. Simply doing a running average isn’t always great and putting in seasonal factors is limited to domain knowledge.
However, data has grown, and now, having access to the “Why” is much more plausible. Everything from social media, to CRMs to online news provide much better insight into why your customers are coming or going!
This data has a lot of noise, and it wouldn’t really be worth it for humans to go through it.. This is where having an automated exploratory system developed will help out a lot.
Finding correlations between content, historical news, and company internal data would take analyst's years. By the time they found any value, the moment would have passed.
Instead, having a correlation discovery system that is automated will save your company time, and be much better at finding value. You can use this system to find those small correlating factors that play a big effect. Maybe your customers are telling you what is wrong with your product, and you just aren’t listening. Maybe, you find a new product idea.
In the Acheron Analytics process, this would be part of our second and third phase. We always look for as many possible correlations, and then develop hypotheses and prototypes that leads to company value.
This process allows companies to have data help define their next steps. This provides their managers with data defended plans. Ones that they can go confidently to their managers with.
When it comes to analyzing your company's content and marketing investments, utilizing techniques like machine learning, sentiment analysis, segmentation which can help develop data driven marketing strategies.
We hope this inspired some ideas how to meld your company’s data! Let us know if you have any questions.
Python is a great language for developers and scripters alike. It allows for some large scale design and OOP concepts. However, it was also developed to be very easy to read and design quick scripts! This is great, because data scientists don’t have all day to spend debugging. They do need to spend some time picking out which python languages will work best for their current projects. We at Acheron Analytics have written up a quick list of the 8 most used libraries that can help your next machine learning projects.
P.s....we had a busy week and couldn't get to an actual code example this week as we promised in our last post. However, we are working on that post! We will shortly have an example in R for a from scratch algorithm.
Theano, according to Opensource.com is one of the most heavily used machine learning libraries to date. The great things about Theano, is it is written leaning on mathematical concepts and computer algebra. When the code is compiled it has the ability to to match C level code.
This is due to the fact that it is written to take advantage of how computer compilers work. This in short is how a computer parses and converts tokens into parse trees, how it optimizes and merges similar sub-graphs, using GPU for computations and several other optimizations. For the full list, check out the Theano main page.
For those who used math based languages like Mathamatic and Matlab, the coding structure won’t seem to strange.
What is great, is that Nvidia fully supports Theano and has a few helpful videos on how to use Theano and their GPUs.
When it comes down to it. Machine learning and data science must have good data. How do you handle that data? Well, one great python library is Pandas. It was one of the first data languages many of us were exposed to at Acheron and still has a great following. If you are an R programmer, you will enjoy this language. It allows you to use data frames, which makes thinking about the data you are using much more natural.
Also, if you are a SQL or RDBMS person, this language naturally fits with your tabular view of data. Even if you are more of a Hadoop or MongoDB follower, Pandas just makes life easier.
It doesn’t stop there, it handles missing data, time series, IO and data transformations incredibly well. Thus, if you are trying to prepare your data for analysis, this python language is a must.
We also wanted to share this great python cheat sheet we found, however, we would feel wrong just stick it on our blog. Instead, here is a link to the best python cheat we have found yet! This even beats Datacamp's cheat sheets!
NumPy is another data managing library. Typically you see it paired with Tensorflow, SciPy, matplotlib and so many other python libraries geared towards deep learning and data science. This is because it is built to manage and treat data like matrices. Again, going back to Matlab and R. The purpose is to provide the ability to do complex matrix operations that are required by neural networks and complex statistics easily.
Trying to handle those kind of operations in multi-dimensional arrays or lists is not the most efficient.
Let's say you want to set up an identity matrix? That is one line of code in numpy. Everything about it is geared towards matrices and quick mathematical operations that are done in just a few lines. Coursea has a great course that you can use to further your knowledge about this library.
How to code for an Identity Matrix:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is an odd one out. Scrapy is not a mathematical language, it doesn’t perform data analysis, or deep learning. It does nothing you would think you would want to do in machine learning. However, it does one thing really well. That is, crawl the web. Scrapy is built to be an easy language to develop safe web crawlers (side note, make sure you read all the documentation, it is built to be a safe web crawling library if you configure it right and that is something you have to research).
The web is a great source of unstructured, structured, and visual data. As long as a sight approves of you crawling and doesn’t mind you using their content(which we are not responsible for figuring out) you can gain a lot of insight into topics. You can use libraries that take words and put them into vectors to help perform analysis, or sentiment analysis, etc. It is much more difficult than using straightforward numbers. It is also much richer. There is alot to be gained fom pictures, words, and unstructured data. With that comes the task of getting that information how of the complex data.
That being said Pattern is another specialized web mining scraper. It has tools for Natural Language Processing(NLP), and Machine Learning. It has several built in algorithms and really makes your life as developer much easier!
We have discussed several libraries such as matplotlib, numPy and Pandas and how great they are for machine learning and data science. Now, imagine if you built and easy to use library on top of all of those, as well as several other easy to use libraries. Well, that is what scikit-learn is. It is a compilation of these libraries to create easy access to complex data science algorithms, data visualization techniques.It can be used for clustering, transforming data, dimensional reduction (reducing the number of features that exist), ensemble methods, feature selection and a lot of other classic data science techniques and they are all basically done in a few lines!
The hardest part is making sure you have a virtual python library when you pip install!
matplotlib and ggplot
Now you have done all this analysis, and run all your algorithms. What now? How do you actually turn around value from all this data you have. How do you inspire your executives and tell them “Stories” full of “Insight” etc. If you don’t want to mess around with D3.js, python has you covered! Using Libraries like matplotlib and ggplot. Both are really built to mimic matlab and R functionality. Matplotlib has some great 3D graphs that will help you visualize your knn and PCA algorithms and clusters.
When you are in your data exploration phase, hypothesis, and final product phase of a product. Using these three languages makes life much easier. You can visualize your data, its quirks and your final results!
We have discussed Tensorflow before on this blog when we talked about some common libraries used by data science professionals. It doesn't hurt to talk about it again though! The fact is, if you are in the world of machine learning, you have probably heard, tried, or implemented some form of deep learning algorithm. Are they necessary, not all the time. Are they cool when done right, yes.
Tensorflow and Theano are very similar. The interesting thing about Tensorflow, is that when you are writing in python, you are really only design a graph for the compiler to compile into C++ code and then run on either your CPU or GPU. This is what makes this language so effective and easy to work with. Instead of having to write at the C++ or CUDA level, you can code it all in python first.
The difficulty comes in actually understanding how to properly set up a neural network, convolutional network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data dropout or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).
These are just a few of the most commonly mentioned python libraries that are utilized by academics and professionals. Do you agree? Feel free to share what languages, libraries and tools you use, even if they aren’t python!
During our last post, we discussed a key step in preparing your team for implementing a new data science solution(How to Engineer Your Data). The step following preparing your data is automation. Automation is key to AI and Machine learning. You don’t want to be filling in fields, copy and pasting from Excel, or babying ETLs. Each time data is processed, you want to have some form of automated process that gets kicked off at a regular interval that helps analyze, transform and check your data as it moves from point a to point b.
Before we can go off and discuss analysis, engineering and QA. We must first assess what tools your company uses. Now, the tools you choose to work with for automation are all up to what you are comfortable with.
If you are a linux lover, you will probably pick Crontab and Watch. Windows users will lean towards task scheduler, the end result is the same. You could choose other tools
Once you know what tool will be running your automation, you need to pick some form of scripting language. This could be python, bash, even powershell. Just because it is a scripting language, we still would recommend creating some form of file structure that acts as an organizer. For instance:
This makes it easier on developers past, present and future to follow code when they have to maintain it. Of course, you might have a different file structure, which is great! Just be consistent.
The Set up:
To describe a very basic set up. We would recommend starting out with some form of file landing zone. Whether this is an FTP or a shared drive. Some location where the scripts have access to needs to be set up.
From there, it would be best to have some RDBMS (Mysql, MSSQL, Oracle, etc) that acts as a file tracking system. This will track when new files get placed into your file storage area, what type of file they are, when it was read, etc. Consider this some form of meta table. At the beginning, it can be very basic.
Just have the layout below:
The key for automation is the final column. Having a flag column that distinguishes whether a file has been read or not. There are also other tables you might want around this. For instance, an error table, a dimension table that could contain customers attached to files info, etc.
How does that info get there? An automation script of course! Have some script whose job is to place new file metadata into the system.
Following this, you will have a few other scripts for analysis, data movement and QA that are all separate. This way, if one side fails, you don’t lose all functionality. If you can’t load, you just can’t load and if you can’t process data, you just can’t process it.
When starting any form of data science or machine learning project. The engineers may have limited knowledge of the data they are working with. They might not know what biases exist, missing data, or other quirks of the data. This all needs to be sorted out quickly. If your data science team is manually creating scripts to do this work for each individual data set. They are losing valuable time. Once data sets are assigned, they should be processed by an automated set of scripts that can either be called using a command line prompt, or even better, automatically.
These basic scripts often contain histograms, correlation matrixes, clustering algorithms, and some straight forward algorithms that require 'N' amount of variables and have a specified list of outputs. This could be logistic regression, knn, and Principle Component Analysis(PCA) for starters. In addition, following each model a summary function of some kind can be run. If using R, this is simply summary().
A function example that we have used as part of previous exploration automation:
Basic Correlation Matrix
Data Engineering Phase
Once you have finished exploring your data, it is important to plan how that data will then be stored and what form of analytics can be done on the front end. Can you analyze sentiment, topic focus and value ratios? Do you need to restructure and normalize the data(not the same as statistical normalization).
Guess what! All of this can be automated. Following the explore phase, you can start to design how the system will ingest the data. This will require some manual processing up front to ensure the solution can scale. However, even this should be built in a way that allows for an easy transition to an automated system. Thus, it should be robust, and systemized from the start! That is one of our key driving factors whenever we design a system at Acheron Analytics. It might start being run from command line, but it should easily integrate to being run by task scheduler or cron. This means thinking about the entire process, the variables that will be shared between databases and scripts, the try/catch mechanisms, and possible hiccups along the way.
The system needs to be able to handle failure well. It will allow your team more time to focus on the actual advantages data science, machine learning and even standard analytics provide. Tie this together with a solid logging system, and your team won't have to spend hours or days trouble shooting simple big data errors.
This is one of the most crucial phases for data management and automation. Qing data is a rare skill. Most QAs specialize in software engineering and less in how to test data accuracy. We have had experience watching companies as they try to find a QA with the right skills that match their data processes, or data engineers who are also very good at QAing their own work. It isn’t easy.
Having a test suite built with multiple test cases that run on every new set of data introduced is vital! And if you happen to make it dymaic when new approved data sets are inserted for upper and lower bounds tests...who are we to disagree!
Ensuring all the data that goes into your system automatically can save anywhere several FTE positions. Depending on how large and complex your data is. A good QA system can manage several data applications with a single person.
The question is, what are you checking? If you don’t have a full fledged Data QA on board, this might not be straightforward. So we have a few bullet points to help you get your team thinking about how to set up their data test suites.
What you and your team need to think about when you create test Suites:
Overall, automation helps save your data science and machine learning projects from getting bogged down with basic ETL, and data checking work. This way, your data science teams can make some major insights efficiently, without being limited because of maintenance and reconfiguring tasks. We have seen many teams, both in analytics and data science lose time because of poorly designed processes from the get go. Once a system is plugged into the organization, it is much harder to modify. So make sure to plan automation early!
In the era of data science and AI, it is easy to skip over some crucial steps such as data cleansing. However, this can cause major problems in your applications later down int he data pipeline. The promise of possible magic like data science solutions can overshadow the necessary steps required to get to the best final product. One such step is cleaning and engineering your data before it even gets placed into your system. Truthfully, this is not limited to data science. Whether you are doing data analytics, data science, machine learning, or just old fashioned statistics, data is never whole and pure before refining. Just like putting bad unprocessed petroleum into your car, putting unprocessed data into your company's systems will either immediately, or eventually wreak havoc(Here are some examples). Whether that means actually causing software to fail, or giving executives bad information both are unacceptable.
We at Acheron Analytics wanted to share few tips to ensure that whatever data science/analytics projects you are taking on, you and your team are successful. This post will go over have some brief examples in R, Python and SQL, feel free to reach out with any questions.
Duplicate data is the scourge of any analyst. Whether you are just using excel, Mysql, or Hadoop. Making sure your systems don’t produce duplicate data is key.
There are several sources to duplicate data. The first comes from when the data is input into your companies data storage system. There is a chance that the same data may try to sneak its way in. This could be due to end-user error, a glitch in the system, a bad ETL, etc. All of this should be managed by your data system. Most people still use RDBMS and thus, using a unique key will avoid duplicates being inserted. Sometimes, this may require a combination of fields to check and see if the data being input is a duplicate. For instance, if you are looking at a vendor invoice line item, you probably shouldn’t have the same line item number and header id twice. This can become more complicated when line items change(but even that can be accounted for). If you are analyzing social media post data, each snapshot you take may have the same post id but have altered social interaction data (likes, retweets, shares, etc). This references slowly changing dimensions, which, is another great topic for another time. Feel free to read up more on the topic here.
In both cases, your systems should be calibrated to safely throw out the duplicate data and store the errors in some error table. All of this will save your team time and confusion later.
Besides the actual source data itself having duplicates. The other common duplicate that can occur is based off an analyst's query. If, by chance, they accidentally don’t have a 1:1 or 1 : Many relationship on the key they are joining on, they may find themselves with several times the amount of data you started with. This could be as simple as restructuring your team's query to make sure they properly create 1:1 relationships, or...you may have to completely restructure your database. It is more likely the former option.
How to Get Rid of Duplicate Data in SQL
Has your company ever purchased data from a data aggregator and found it filled with holes? Missing data is common across every industry, sometimes it is just due to system upgrades and new features being added in, sometimes just bad data gathering. Whatever it might be, this can really skew a data science projects results. What are your options then? You could ignore rows with missing data, but this might cost your company valuable insight and including the gaps will produce incorrect conclusions. So, how do you win?
There are few different thoughts on this. One is to simply put a random and reasonable number in place of nothing. This doesn’t really make sense, as it is difficult to really tell what is being driven by what feature. What is a more common and reasonable practice is using the data set average. However, even this is a little misleading. For instance, on one project we were involved with, we were analyzing a large population of users and their sociometric data(income, neighborhood trends, shopping habits). About 15% of the data was missing that was purchased from a credit card carrier. So throwing it away was not in our best interest.
Instead, because we had the persons zipcodes, we were able to aggregate at a local level. This was a judgement call. A good one in this case. We compared this to averaging the entire data set, and we really got a much clearer picture on our populations features. The problem with a general average over several hundred thousand people is that you will eventually have some odd sways. For instance, income, if your data set is a good distribution, you will end up with your average income being, well, average. Then, suddenly, people that may have lived in richer neighborhoods may suddenly create their own classification. The difference between 400k vs 50k(even when normalized) can drastically alter the rest of the features. Does it really make sense for someone who is making 50K a year to be purchasing over 100k of products a year? In the end, we would get a strange cluster that was large spenders, who made average income. When your focus is socio-economic factors. This can cause some major discrepancies.
How to Handle Missing Data with SQL
Data normalization is one of the first critical steps to making sure your data sensible to run in most algorithms. Simply trying to feed in variables that could be anything from age, income, computer usage time, etc, creates the hassle of trying to compare apple to oranges. Trying to input 400k to 40 years will create bad outputs. The numbers just don’t scale. Instead, the concept of normalization allows your data to be more comparable. It takes the max and min of a data set and sets them to the 0 and 1 of a scale. Now, the rest of the numbers can be scaled. Utilizing 0-1 allows your data science teams to meld the data smoother. They are no longer trying to compare scales that don't match. This is a necessary step in most cases to ensure success.
R Progamming Normalization
Python(This can also depend on whether you are using Numpy, Pandas, etc)
Data preparation can be one of the longer steps when preparing your teams data science project. However, once the data is cleaned, checked, and properly shaped, it is much easier to pull out features, and create accurate insights. Preparation is half the battle. Once the data is organized, it becomes several times easier to mold. Good luck with your future data science projects and feel free to give us a ring here in Seattle if you have more questions about your data science projects
Future Learning! And Other Data Transformations
We wanted to supply some more tools to help you learn how to transform and engineer your data. Here is a great video that covers several data transforms. This particular video relies on the R programming language.
Start-ups focused on data science, machine learning, and deep learning have blown up in cities like Seattle, San Francisco, New York, and just about everywhere else. They are hitting every field, finance, health, education, customer service, and beyond. Yet, some companies are still working out the kinks. Some are still trying to find the out how to turn their data and machine learning talent into fiscal gain.
Is machine learning just another fad? Isn’t business intelligence enough? Is there and need to learn languages and tools like R, Tensorflow, and Hadoop?
The truth is, not every problem can be solved with a deep learning algorithm, or automated chat bot. So how do you know if you are facing a problem where a data science solution is necessary?
Whether you are a service based industry or widgets factory, your company will naturally produce data as a byproduct of operations. You may even take the time to track this data in a very organized set of excel sheets or databases. That data is worth looking back at! Here are some projects and techniques that could lead to reduced costs and increased revenue that require basic data science and predictive analytic techniques that don’t require huge changes and investment in new hardware and expensive talent.
Predictive Analytics(A.K.A. Forecasting)
Some basic things that can be done for small companies is setting up a decent excel sheet that predicts future demand and pricing optimization. This isn’t data science per se. Forecasting has been done for business since the beginning of business. Some may call it ‘Predictive Analytics’, to give it a fancier sounding term, but it is just basic forecasting. If you aren’t doing this already, it would be a valuable place to start. Utilizing your day to day data, you can create a basic application that helps spot trends. One such example is what days are the best days to be open for business. One real life example involves a small restaurant that actually found they could save money by closing down on specific days of the year, because they had actually been losing money consistently on the specified days. So, instead, when they discovered this using forecasting they began to give these days off to their staff. This allowed for both the owners to increase their profits, and their staff was generally happier(they would otherwise put in 45+ hour weeks).
If your skill lies outside of this subject matter, it would be best to hire a team specialized in predictive analytics/forecasting. When set-up correctly, the application should be easy enough to maintain with minimal technical skills.
Multi Armed Bandit
For those of you familiar with A/B testing, this concept of the Multi-Arm bandit may sound similar. However, there are distinct differences. The Multi-armed bandit refers to a theoretical problem in which a gambler walks into a casino and has to decide the best method of approaching multiple slot machines (e.g. One armed bandits) in order to get the best outcome (based on the winning probability of each machine). This same idea has been used in website testing, ad-placement optimizing , story recommendations, etc. Basically, any time the computer can test and see which type of content an end-user will most likely click on. This requires you track your end-users actions and set up either a dynamic or time delayed algorithm that figures out what is the most effective piece of content to show an end-user to get them to interact. From here, you could even go as far as testing out some more complex neural networks once there is enough data gathered to see what features in the content itself, make it more attractive.
If you do A/B testing, there are some caveats if you plan to switch to using the Multi-armed Bandit method. Here is a great read about some of the drawbacks from VWO.com. It is good to know the pros and cons!
More complex statistics can be pulled by melding multiple data sources. Not just taking your everyday operations, but melding in customer data from multiple areas like social media, credit card data, etc. Could lead to a treasure trove of new revenue streams. It could help your company find new verticals to expand in, or which new neighborhood you should place your next store.
If you have an online store, you have even more possible data sources. The use cases expand as you increase the types of data. This will require some pruning and feature selection, depending on the style of machine learning/data science you are looking into applying. If you have enough data, with the right formats, you might be able to get into deep learning. However, realistically, if your company is small to medium sized, solid data science techniques will probably work best.
Let’s say you are a charity, and you have a backlog of donors that consistently donate every year. What if you could find more people like them? Some companies try to call thousands of people to find the few hundred that will donate. However, if you have their emails, or better yet, the social media tags. You can utilize social media services that help find these two hundred people more effectively. Companies like Architech Social and Facebook both provide services that help you grow your business by finding ‘look alikes’. Basically, they help you advertise to plausible future customers, that act like your current customers.
How you may ask? they are using a combination of clustering, classification and neural networks. Clustering and classification both utilize complex statistical techniques to help figure out the probabilities that one customer is similar to another. This could be based off of end-user patterns, product preferences, social factors, purchase history, and even comments, and posts on social media. At the end of the day, this requires a combination of most of the techniques mentioned above. This, of course, is only one example of using complex statistical analysis to find value in data.
These services don’t come cheap. Getting a solution in place that is a one time deal can help reduce the monthly bill Facebook and other digital advertisers charge.
These are just a few examples where data science,machine learning and analytics could be useful. If you are not sure about whether or not you have a project worth pursuing, feel free to give us a ping. We would be happy to help you figure out if a project is even worth spending time on.
Are you looking for great tools for machine learning, data science or data visualization?
Currently, there are an overwhelming amount of options. Do you pick Tensorflow or Theano? Tableau or Qlik, MongoDB, Hadoop...oh dear. How do you know which data tool to use? Many of them are good, some, not so good. In the end, you might need an algorithm just to pick which data science tools are best for your team.
We just wanted to go over a few technologies we personally love to work with. This in no way is all of them, but these are definitely some of the best options out there. However, every use case is different, give us a call if you are trying to decide what tools would work best for your company or project! We would love to help.
Libraries and Languages
Tensorflow itself is not a language. It is built off of C++ and Nvidia’s CUDA. The library itself is typically implemented using Python. However, it is not actually executed in python. Python just allows the end-user to design the data flow graph that will be used by the much faster lower level languages. Some people actually go down into the raw C++ level to further optimize their run times.
Even the overall design style of tensorflow can feel a little wonky if you are python programmer. Compared to python, you might feel like you are writing in a more model or math based language like mathematica or Matlab. You will declare several sets of variables before even actually setting an actual variable. This can be a little jarring for python purist. Overall, the basics are pretty easy to pick up.
The difficulty comes in actually understanding how to properly set up a neural network, convolution network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data drop out or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).
You can’t get to far in the data science world without finding a few programmers who enjoy R. It is a very well developed language that is great for statisticians and CS majors alike. I do think it doesn’t offer the cool factor that python and other more ‘modern’ machine learning languages. However, it is a great workhorse and it is tried and true.
It can provide some newbies a false understanding of data science. Libraries to ensemble and boost algorithms require minimal knowledge of the algorithms themselves. This is great if you know why you are picking each algorithm. However, the illusion can lead to a false sense of understanding.
As a side-note, Sql Server actually just implemented the ability to run some R functions in a query. Not 100% on the performance, but it would be pretty cool if you could run a lot of data analysis even before getting the data out of the system.
The first time I was exposed to Caffe for deep learning was back when I took a compuational neuro class. We were programming in Matlab, but one of the other students happened to show me the work he was doing in his lab. It was all using the Caffe framework. Most of this requires modifying .sh configuration files. You can alter what type of network layers you are using, how many neurons, drop out, etc. For those more accustomed to running command line files, it works pretty smoothly and it works both on GPUs and CPUs.
Data Visualization Tools
Tableau is arguably one of the most popular data visualization tools for analysts no matter their proficiency with technology. Whether you are an engineer who develops complex neural networks or a business analyst who prefers to model in excel. Tableau is a friendly and easy to use data visualization tool. It allows the end-user to develop visually appealing interactive reports that help executives make decisions quickly.
On top of that, if your company has its own Tableau server, it also allows for quick and easy data transfer through beautiful and effective reports. If you just want to download the data itself, Tableau also allows you to download CSVs, screenshots, PDFs etc. You can even have reports emailed to you on a specific cadence.
This tool was built with the end-user in mind. One of my favorite features is Tableau Public. It allows you to share your reports publicly. Obviously, you can’t do this with company data. However, there are plenty of fun open data sets that can be used to make some beautiful and effective data reports. Check it out!
(Credit of Tableau goes to https://public.tableau.com/profile/gabe.dewitt#!/)
One of our employees was first introduced to D3js in college. Professor Jeff Heer, came and gave his class a 1 hour lecture on the library.He was instantly sold. The power D3 had to display data and let the end user drill into specific facts was amazing. This was his first exposure to data visualization in this manner.
Sure, before this he had seen matlab charts, and excel graphs, but nothing like this. He found he could use his skills with D3 to create graphs that were appealing, informative and interactive. There were so many benefits for an end user. Plus, unlike Tableau and other data viz tools. D3js allowed for almost unlimited customization.
We use it when customers are trying to avoid the steep costs of tableau, along with some other js libraries. There are some limits. For instance, D3 runs on the client side, this means it cannot handle the sheer magnitude of data that Tableau can.
Domo has some similarities to Tableau. It allows the end-user to have very pretty graphs and is used for KPI reports. It quickly integrates with over 1000+ data sources and manages large amounts of data quiet easily. From their it quickly melds data and creates pre-formatted KPI reports that can be shared across their internal platform. This is great, especially if your team doesn’t have enough resources to develop highly effective reports. Within minutes, your team can have standardized reports from tools like Salesforce, Concur, etc. In addition, if integrated properly, your company may be able to reduce its reporting tools. Thus, reducing maintenance, development and design costs.
There is some ability for modification. However, it is limited compared to most other data visualization tools. This will drive typical developers crazy. We love being able to get down to system level and actually modifying what each small component does and not being limited by buttons. However, if your team can't afford the extra person to create the reports, then this tool will save a large amount of resources.
Data Storage Tools
You can’t say big data and not have at least one person bring up Hadoop. Hadoop is not for the fain of heart. It does an amazing job at distributing storage of very large data sets over on computer clusters. However, unless you are comfortable with Java and command line type environments, it is not an easy beast to wrangle. It requires a heavy amount of configuration and manipulation to ensure it works optimally on your companies system. The reward at the end though, is more than well worth it. The ability to access data quickly, even with hardware failure is pretty hard to beat. For small companies, the maintenance and costs to employee a Hadoop specialist would probably be too large.
At some point in time, data storage was very expensive and databases had to be finely tuned to perfectly manage every byte and bit. Thus, the relational model was decided to be one of the best options of data storage.
Now we are in 2017 and due to both hardware and software advances. Data has become much cheaper. Suddenly, storing large masses of unstructured data is feasible and can be beneficial when designed well. MongoDB is a document store database. Instead of storing rows, it stores an entire document in one instance. This means you no longer have to query the entire database just to get two related data points. Instead, they will all remain on the same document(this used to be bad because this meant a lot of duplicated data). Now it is great. It allows speed to increase dramatically. There are plenty of pitfalls with MongoDB. This includes security, storage, and it is not ACID compliant(click here to read about ACID).
Oracle in itself could represent most other RDBMS. However, I find Oracle one of the best databases for managing big data. This has many reasons. Everything from the underlying architecture of the DB itself, to the ability to manage and manipulate the configuration of objects inside is much better tuned in Oracle. SQL Server is great for beginners, but it just doesn’t do what Oracle can.
There are a plethora of data tools out there for professionals to use. All with their specific use case and strategies. Tableau is one of several highly utilized data visualization tools. It is easy to use, can be connected with multiple data sources, including flat files, databases, CRMs, etc. Honestly, for a non-technical person, it is probably one of the best tools.
However, for all the good Tableau has, it has a dark, hidden secret….
What you may ask? It is quite simple. It is too easy to use!
Wait, What? How does that even make sense. Aren’t tools supposed to be easy. Shouldn’t we just be able to click twice and have a beautiful report? Yes, you should be able to do that, we are in total support of simplifying technology.
However, there is the risk of failing to plan in these situations though…
Now, if your team is accustomed to proper software development procedures, and has good SDLC practices, you are in good hands. If, however, your team is a bunch of bright, and intelligent business analyst who are eager to impress you. You might have problems.
There is a fine line between the constant prototyping and documenting every literal change request(including font changes). Somewhere in the middle, is where you need your BI and Data Science teams to be. They need to build you reports quickly but they also need to ensure they build them in a maintainable way with proper QA steps, analyses stages and risk assessments.
Where does the ease of Tableau fall into all this. If you can make a report in a few clicks, the temptation to not actually think about what you are creating, what data sources it relies on, and how to QA it. How do you know the report will always work and that it is always accurate? Do you know if your data is good? Who manages that data? What happens if the data source disconnects or changes? Did you contact the owner to make sure they know you are using their data? And most importantly, did you check if you already have this report and do you need to revamp it, or not make a new one.
We brought up just a few questions that need to come up when planning any data driven project. We were mostly using them as examples to why Tableau can cause inexperienced teams problems.
Tableau is a great tool. In fact, I love it for prototyping, and mock-ups. We can demo an example report for a client, or executive in a range of 30 minutes to a few hours. The report might even be usable at that stage (This depends on how well I know the data source). That doesn’t mean it should be. Tableau is created to be robust, but it has a lot of issues that need to be considered.
Besides being too easy to use (Yes, we know that sounds weird). Tableau isn’t always as smooth with integrating with technologies as you may think. For instance, we once created a pretty cool report that you could track all your invoices and click a link to them using tableau. Pretty snazzy right! All the executives were raving how they could actually track their invoices. Before..they were literally just told they owed money..
The invoices were connected to an internal app that only worked on IE. So the end-users had to use IE to get to the invoices. This was fine for about 6-7 months. Then, one day, the company updated the Tableau. Things should have been hunky dorey right?
Of course, there was one problem. This version of Tableau worked terribly in IE. It would throw errors, freeze up and just cause constant issues. We couldn’t do anything to fix it and because our internal tool only worked on IE, there was no winning. Obviously, we can’t go into the source code of either program and try to get it to work. Now, a dashboard that our entire executive team relied became unusable.
Some of this is due to the companies haphazard approach to third party applications. However, it is also a good example of how sometimes, you miss mapping out risks, which makes it even more important that you analyze the each component of your future solution.
Tableau is really a great tool, but like any tool, it requires a decent understanding on how to implement it. Tableau can be very dangerous. As the great saying goes, with great power, comes great responsibility(To learn tableau best practices).
Business and technology continues to struggle to find a harmonious unity. With new buzzwords floating around like 'Big Data', ‘Data Science' and 'Machine Learning', it has now become even more difficult for the business and tech teams to work together coherently.
In the book, Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work, the author points out that “Despite the excitement around ‘data science’, ‘big data’, and ‘analytics’, the ambiguity of these terms has led to poor communication between data scientists and those who seek their help”.There remains a chasm in between technology and business. In the end, how do we get these two estranged business entities to find a common language?
We recognize that technology has always been a source that creates buzzwords and inspires innovation. Data science has been the source of a lot of the excitement for about a decade. When people experience and read what companies like Google, Amazon and Facebook have been able to do with their data and the value they have gained, of course managers want to jump on the opportunity.
Wouldn’t you want to be the one that brought your company to the level of all these titans of the tech industry? That would be pretty sweet!
However, in the rush to get to their final destination, these management teams try to cut corners and end up skipping the journey. These giant corporations like Facebook and Google, have had years to perfect their craft. They didn’t jump on the band wagon, they built it. They realized that good data science and machine learning started with good data, and proper processes to get the end user (whether the data scientist or the customer) the data they really need.
Most of these companies have been doing it for over a decade. Now, we have lots of copycats trying to imitate other companies success. Managers are hiring data science teams, BI teams, and applied mathematicians left and right. Then, they try incorporating them into their companies current tangled eco-systems of specialized teams.
When this occurs, most managers run into several problems.
Your Data Is Locked up
A common problem most data scientist and BI analyst experience is getting to their companies data. This typically is caused by people with great intentions. DBAs are often the sole guardian between data that is operation critical and the slew of internal threats like devs, BI analysts, and fresh out of school grads. If allowed, these intelligent but unaware miscreants would drop an entire database, delete tables, and cause all sorts of mayhem without even knowing it.
Guess who gets blamed for all of this? And who has to fix it?
Yup, the DBA.
They have a lot of good reason for not wanting to freely let anyone have access to their data stashes. However, this causes data scientist to be bogged down with processes. Even if they are allowed data, they often have to construct their own pipelines/ETLs and structure new databases.
All of this eats into a data scientist’s ability to be productive at what they do best. Suddenly, a simple 4-week project of analyzing sales data, becomes a eight-month-long mission to create a new datawarehouse, build ETLs and QA the data before they can even start your first bit of analytics work.
When you are paying an employee upwards of 100k, this isn't practical. You need them to be fully functional, and fast!
Your Data is Not Reliable
For all the hubbub about the value of data, very few data evangelist warn companies about dirty data. Most Hadoop infrastructure salesmen, and tableau specialist are just trying to sell the next “it” product. They convince companies and managers that no matter what shape their data is in, they have the tools to fix it.
If they’re anything like that, it should tell you that they have never worked with data beyond the 3 months of intro their start up gave them. Of course, the less the salesman knows the better. If he doesn’t realize the limitations of his own product, it is easier for him to tell the truth without knowing he is lying.
In some cases, data is just wrong. Over the years, systems may have never been QAed and the gold mine you think you are siting on, may just be a garbage dump.
First thing is first, before you go off and buy yourself new toys to utilize your data, make sure your room is clean. I don’t care how this is done. Whether you hire someone internally, or look for consultants. Get someone to analyze your data sources every few years.
Some form of an audit can be very valuable. Something that guarantees your data is good and can trace when it might have gone bad. Otherwise, when you spend several hundred thousand dollars of capital budget on a data science tool or new machine learning algorithm, you might just end up with an expensive paper weight “so to speak”.
Your Data Team Is New To Business
Data scientist are a rare breed. While every business analyst who has used SQL or R for three months are suddenly throwing the title on their resume, there are truly very few data scientist that meet the true qualifications that businesses are looking for.
Larger companies have gone so far as to plucking professors out of colleges with three PhDs, twenty research papers and a Nobel Peace Prize because they prove to be the most qualified for the role of data scientist. Of course, once you have captured a few of these illustrious creatures, what are you going to do with them?
A lot of businesses just throw them at data, and expect value to be found. Don’t get me wrong, ambiguity is great and all but how do you expect them to know what is valuable to you, or the company? Especially if you just hired the team?
You can’t just expect them to know what you want.
There needs to be open dialogue that allows them to see what the business needs to reduce the tech gap. There needs to be conversations of what the businesses sees as strategic initiatives, expected data science team outputs and external threats that can be mitigated using the company's new possible competitive advantage. Even machine learning engineers are given 1.5 years to come up with each valuable new algorithm or concept. Data science teams won’t always be successful right away. It will take even longer if your company doesn’t include an executive to lead these teams who is also part of high level strategic meetings. That way, he can help provide insight to the team in where the company is going.
This gap between business and technology isn’t new. We have dealt with it since before this generation and only a few companies that were tech centric really seemed to find the solution. Others remained in the dark, and fumbled when it came to finding value from their tech and data teams.
This gap is caused by the fact that management can’t qualify what they want properly and the data scientist don’t ask the right clarifying questions. Yes, requirements should be somewhat vague to avoid boxing in your results. However, managers are still responsible for defining what is valuable. Without doing that, your team might learn some cool things and create awesome dashboards and websites. They might do all of this, and provide your company a utility of 0, Nilch, Nada!
Your department might have just spent hundreds of thousands, maybe even millions of dollars to start this team and they might provide some cool deliverables. All of which, don’t benefit you, or your strategy team can’t implement. Then what was the point. If you just spent so much time, and money on a project that yielded no form of benefit besides a cool dashboard...why?
Data scientists have a lot of good energy. They are brilliant in multiple fields. Only some are brilliant in making value for the business without being told what that means (in all fairness, this is a trait that is not specific to data scientist, accountants, analysts, and even managers sometimes fail here). When you are an individual contributor, it can sometimes be hard to see the big picture. This is what I believe business managers are for.
Overall, data science and analytics provide companies the opportunity for a competitive advantage. It allows managers to make better decisions, and connect better with the customer. However, it is important to take a close look at what resources you have to work with, before attempting to bring projects in front of your data science teams.