Recently, our team of data science consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models.
We love the fact that that her project was not just technically challenging, but that it was geared towards a bigger purpose than selling toasters or keeping customers from quitting your telecommunication plan! She also brought up her experience interviewing for data science roles at Microsoft and other large corporations and how it taught her so much. We wanted to share what she learned so we asked if she would write us a guest post! And she said yes! So without further ado, here is
How to Prepare for a Data Science Interview:
If you are here, you probably already have a Data Science interview scheduled and are looking for tips on how to prepare so you can crush it. If that’s the case, congratulations on getting past the first two stages of the recruitment pipeline. You have submitted an application and your resume, and perhaps done a take home test. You’ve been offered an interview and you want to make sure you go in ready to blow the minds of your interviewers and walk away with a job offer. Below are tips to help you prepare for your technical phone screens and on-site interviews.
Read the Job Description for the Particular Position You are Interviewing for
Data Scientist roles are still pretty new and the responsibilities vary wildly across industries and across companies. Look at the skills required and the responsibilities for the particular position you are applying for. Make sure that the majority of these are skills that you have, or are willing to learn. For example, if you know Python, you could easily learn R if that’s the language Data Scientists at Company X use. Do you care for web-scraping and inspecting web pages to write web-crawlers? Does analyzing text using different nlp modules excite you? Do you mostly want to write queries to pull dataca from SQL and NoSQL databases and analyse/build models based on this data? Set yourself up for success by leveraging your strengths and interests.
Review your Resume before each Stage of the Interviewing Process
Most interviews will start with questions about your background and how that qualifies you for the position. Having these things at the tip of your fingers will allow you allow you to ease into the interview calmly as you won't be fumbling for answers. Use this time to calm your nerves before the technical questions begin.
Additionally, review your projects and be prepared to talk about the Data Science process you used to design your project. Think about why you chose the tools that you used, the challenges that you encountered along the way, and the things that you learned along the way.
Look at GlassDoor for Past Interview Questions
If you are interviewing for a Data Scientist role at one of the bigger companies, chances are they’ve already interviewed other people before you, who may have shared these questions on GlassDoor. Read them, solve them, get a feel of the questions you will be asked. If you cannot find previous questions for a particular company, solve the data science questions from other companies. They are similar, or at the very least, correlated.
Moreover, even if there are no data science questions for that particular company, see what kind of behavioral questions are asked.
Ask the Recruiter about the Structure of the Interview
Recruiters are often your point of contact with the company you are interviewing at. Ask the recruiter questions about how your interview will be structured, what resources you should use when preparing for your interview, what you should wear to the interview, and even the names of your interviewers so you can stalk look them up on LinkedIn and see their areas of specialization.
Do Mock Interviews
Interviewing can be nerve-racking, more so when you have to whiteboard technical questions. If possible, ask for mock interviews from people who have been through the process before so you know what to expect. If you cannot find someone to do this for you, solve questions on a white board or notebook so you get the feel of writing algorithms some place other than your code editor.
Practice asking questions to understand the scope and constraints of the problem you are solving. Once you are hired, you will not be a siloed data scientist. It is reasonable to bounce around ideas and see if you are on the right track. It is not always about getting the correct answer, which often does not exist, but about how you think through problems, and how you work with other people as well.
Practice the Skills that you Will be Tested On
Your preparation should be informed by the job description and the conversation with recruiters. Study the topics that you know will be on the interview. Look up questions for each area in books and online. Review your statistics, machine learning algorithms, and programming skills.
Additionally, Spring Board has compiled a list of 109 commonly asked Data Science Questions.
KDnuggets also has a list of 21 must know Data Science Interview Questions and Answers.
Follow Up with Thank You Emails
This is probably standard etiquette for any interview but remember to send a personalized thank you email within 24 hours of your interview. Also, if you have thought of the perfect answer to that question you couldn't solve during your interview, include it as well. Don’t forget to express your enthusiasm for the work that Company X does and your desire to work for them.
If you get an offer after your first round of data science interviews, Congratulations! Close this tab and grab a beer. If you are turned down, like most of us are, use the lessons you learned from your past interviews to prepare for your next interviews. Interviews are a good way to identify your areas of weakness, and consequently become a better candidate for future openings. It’s important to stay resilient, patient, and keep a learner’s mindset. Statistically, you probably won't get an offer for each position you apply for. Like the excellent data scientist you are, debug your interviewing process and up your future odds.
Other Great Data Science Blog Posts To Help Make You A Better Data Scientist!
How To Ensure Your Data Science Teams And Projects Succeed!
Why And How To Convince Your Executives To Invest in A Data Science Team?
As more companies turn towards data science and data consulting to help gain competitive advantage. We are getting more people asking us - Why should we start a data science team? We wanted to take a moment and write down some reasons why your company needs to start a data science team. It is not just a fad anymore, it is becoming a need!
Maintain Competitive Advantage
As more and more companies start integrating data science and better analytics into their day to day strategy. It will no longer be a competitive advantage to make data driven decisions. It will be a necessity. Executives will have to rely on accurate data and sustainable metrics to make constantly correct decisions. If your company moves based on the actual why, vs. the speculations and surface level problems, then they can make a greater and more effective impact internally and externally.
Better Understanding of Current and Possible New Customers
Data science helps give executives and managers a deeper understanding into why customers make the choices they do. Google has had the greatest opportunity as it has almost become a third lobe of some people’s brain. We tell it when we are happy, sad, looking for love, studying, etc. It knows what our questions are. However, Google is not the only one, other companies are beginning to realize that their customers have been telling them their opinions on multiple social platforms and blogs. They just need to go and look and see how to better hear from their customers data.
This is a great opportunity for corporations to seek out what their customers are feeling, about their products, about their companies image and what other stuff they are purchasing. Sometimes this may require purchasing third party data. Even then, this might be worth it depending on the projects being done.
Better Understanding of Internal Operations
Whether it is HR, finance, accounting or operations, data science is helping tie all these fields together and paint a better picture of what is happening in a company. Why are people leaving, why was there so much overtime, how can a company be better at utilizing resources, and so on. These questions can be answered by taking high quality data and blending it to find out the whys. This in turn will provide a better work place environment and increase resource efficiency.
Increases Performance Through Data Driven Decisions
Data science is still a relatively new field. Many companies are still figuring out how to properly implement data science solutions. However, those that do are seeing amazing results. Look no further than Amazon, or Uber. These companies are changing and have changed the way we view certain industries. Why, because they know what their customers want, they know what the industry is doing wrong, and they know how to charge the customer less, but give them more.
Overall, increasing data science and analytics proficiency allows for your executives to trust their data more and make clearer decisions. Consider looking into a team today!
Is your company looking to figure out who should become data scientists and how to start a team? You are not alone, even Amazon and Airbnb are starting internal universities to teach more of their teams the values of data science. Maybe your company needs help setting up some internal classes to help increase your data science an machine learning skill sets. Acheron provides multiple forms of internal education programs. They can be for managers, or analysts. One form is a quick guide to how to run a data science team! This a for managers and executives who are starting, or already have a data science team and want to ensure they are getting the best return on investment from their team and that their team members all feel challenged!
We took one sub section out and wanted to share a common question we get when we talk to executives. Who are data scientists, and who should become one! One such client told us they have loads of scientists, but wasn't sure how to turn them into data scientists, and who in their cohorts should really become one.
Below we will go over some of the top soft skills data scientists should have, and what type of personality should someone have before they enroll in some form of data science program. Whether this be an internal program, or external, like Galvanize, or a university data science certificate. In the end, data science is a skill that companies will need to harness to make sure they can keep up with the rest of their competitors who are already successfully implementing data science into their upper level strategy.
Who are Data Scientists?
Data scientist have to be driven individuals. They not only must be technically savvy, they also need to be proactively aware of their company’s nuances. If they happen to see a correlation or pattern, they will seek out how to access the data required and will bring possible projects up to their manager.
Being driven is great, especially when combined with curiosity. Data scientists love to ask why, and not stop until they find out the root cause. They are great at pinpointing that actual patterns in the noise. This is a necessary skill in order to peel apart the complexity and relationships various data sets may have. Occasionally, an individual may have a curious mind, but may lack the drive to act upon their inquiries.
Tolerance of Failure
Data science has a lot of similarities to the science field. In the sense that there might be 99 failed hypotheses that lead to 1 successful solution. Some data driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations every year to year and a half. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc). This means a data scientists must be willing to fail fast and often. Similar to using the agile methodology. They have to constantly test, retest, and prove that their algorithms are correct.
The term data storyteller has become correlated with data scientist. This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields. This requires taking complex mathematical and technological concepts and creating clear and concise messages that executives can act upon. Not just hiding behind their jargon, but actually transcribing their complex ideas into business speak.
Creative and Abstract Thinking
Creativity and abstract thinking helps data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases. Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires thinking outside of the box.
Data scientists have to be able to take large problems, like what ad to show to which customer, then based off of hundreds of variables effectively find the right solution. This means taking a larger problem and breaking it down to its smallest parts. Getting rid of noise, and variables that don’t help create a clear pattern. This can sometimes be a messy process. Being able to keep focused on the bigger problem is key.
Who Should Become a Data Scientist
The skills required to be a data scientist are constantly evolving and many companies are trying to find out how to train new data scientists. In the end, the real question is, who should become a data scientist?
Data science requires constant learning. Not just technology, but it also requires constant learning of new fields, specialties and situations. Especially as data science solutions further integrates into more and more departments of corporations. Becoming familiar with one set of vocabulary, and processes is not an option. Without having some bearing in each field limits the hypothesis and logical assumptions required to be made by a good data scientist.
If you are searching for a data scientist inside your company. They are probably already attempting to push into the field. With all the online material, classes, and meet-ups, an individual would have already taken steps to get more involved. If they merely talk about it, but never act upon it, they will act similarly on a new project or idea.
There is some requirement for computational or technical abilities. Excel is a great tool, but there is a need to be able to use more powerful and customizable tools. This includes programming, data visualization and data storage tools. There is no need to be a software engineer. However, data scientists have a general idea of how to make sure code is maintainable, robust and scalable.
Looking to start a data science team?
If you are looking to start a team of your own. Feel free to comment, or email us! We can do everything from point you in the right direction of readings if you want to do it yourself, to come and join you on your journey! Also, feel free to follow our blog. We will keep it up to date as we do new projects, and new questions about data science! If you email us a question, we will try to post about it!
Unstructured Data, and How to Analyze it!
Content creation and promotion can play a huge role in a company's success on getting their product out there. Think about Star Wars and Marvel. Both of these franchises are just as much commercials for their merchandise, as they are just plain high quality content.
Companies post blogs, make movies, even run pinterest accounts. All of this produces customer responses and network reactions that can be analyzed, melded with current data sets and run through various predictive models to help a company better target users, produce promotional content, and alter products and services to be more in tune with the customer.
Developing a machine learning model can be done by finding value and relationships in all the different forms of data your content produces, segmenting your users and responders, and melding all your data together. In turn, your company can gain a lot more information, besides the standard balance sheet data(see picture above).
Change Words to Numbers
Machine learning has created a host of libraries that can simplify the way your team performs data analysis. In fact, python has several libraries that allow programmers with high level knowledge of data science and machine learning application design and implementation the opportunity to produce fast and meaningful analysis.
One great Python library that can take content data like blogs posts, news articles, and social media posts is TextBlob. TextBlob has some great functions like
“Scary Monsters love to eat tasty, sweet apples”
You can use the lines below to pull out the nouns and what was used to describe said nouns.
How to use TextBlob to Analyze Text Data
This takes data that is very unstructured and hard to analyze, and begins to create a more analysis friendly data sets. Other great uses of this library are projects such as chat bots
From here, you can combine polarity, positivity, shares, topic focus to see what type of social media posts, blog posts, etc, become the most viral.
Another library worth checking out are word2vec which exists in Python, R, Java, etc. For instance, check out deeplearning4j.
Marketing Segmentation with Data Science
Social media allows for once hard to get data such as, people's opinions on products, their likes, dislikes, gender, location, and job to be much more accessible. Sometimes you may have to purchase it, other times, some sites are kind enough to allow you to take it freely.
In either case, this allows companies an open door to segmenting markets with much finer detail. This isn’t based off of small surveys that only have 1000 people, we are talking about millions, and billions of people. Yes, there is a lot more data scrubbing required. But there is an opportunity to segment individuals, and use their networks to support your company's products.
One example is a tweet we once passed off to SQL Server. They quickly responded. Now, based off the fact that we interacted with SQL Server and talk so much about data science and data. You probably can assume we are into technology, databases, etc. This is basically what twitter, facebook, Google, etc do to place the right ads in front of you. They also combine cookies, and other data sources like geolocation.
If you worked for Oracle, perhaps you would want me to see some posts about the benefits of switching to Oracle, or ask for my opinion on why someone prefers(we personally have very little preference, as we have used both, and find both useful) using SQL Server over Oracle. Whatever it may be, there are opportunities to swing customers. Now what if your content was already placed in front of the right people. Maybe you tag a user, or ask them to help you out or join your campaign! Involve them, see how you can help them.
For instance, bloggers are always looking for ways to get their content out their. If your company involves them, or partners with them in a transparent way. Your product now has access to a specific network. Again, another great place where data science and basics statistics come into play.
If you haven’t tried tools like NodeXL, it is a great example of developing a model to find strong influencers in specific networks. This tool is pretty nifty. However, it is limited. So you might want to make some of your own.
Utilizing the data gathered from various sites, and algorithms like K nearest neighbor, PCA, etc. You can find the words used in profiles, posts and shares, the company's customers interact with, etc. Then:
The lists goes, on. It may be better to start with NodeXL, just to see what you are looking for.
Now what is the value of doing all this analysis, data melding, and analytics?
ROI Of Content:
At the end of the day, you have plenty of questions to answer.
These aren’t the easiest question to answer. However, here is where you can help turn the data from your social presence into value for your company:
Typical predictive analytics utilize standard business data(balance sheet, payroll, CRM, and operational data). This limits companies to the “what” happened, and not the why. Managers will ask why did the company see that spike in Q2? Or dip or Q3? It is difficult to paint a picture when you are only looking at the data that has very little insight into the why. Simply doing a running average isn’t always great and putting in seasonal factors is limited to domain knowledge.
However, data has grown, and now, having access to the “Why” is much more plausible. Everything from social media, to CRMs to online news provide much better insight into why your customers are coming or going!
This data has a lot of noise, and it wouldn’t really be worth it for humans to go through it.. This is where having an automated exploratory system developed will help out a lot.
Finding correlations between content, historical news, and company internal data would take analyst's years. By the time they found any value, the moment would have passed.
Instead, having a correlation discovery system that is automated will save your company time, and be much better at finding value. You can use this system to find those small correlating factors that play a big effect. Maybe your customers are telling you what is wrong with your product, and you just aren’t listening. Maybe, you find a new product idea.
In the Acheron Analytics process, this would be part of our second and third phase. We always look for as many possible correlations, and then develop hypotheses and prototypes that leads to company value.
This process allows companies to have data help define their next steps. This provides their managers with data defended plans. Ones that they can go confidently to their managers with.
When it comes to analyzing your company's content and marketing investments, utilizing techniques like machine learning, sentiment analysis, segmentation which can help develop data driven marketing strategies.
We hope this inspired some ideas how to meld your company’s data! Let us know if you have any questions.
Python is a great language for developers and scripters alike. It allows for some large scale design and OOP concepts. However, it was also developed to be very easy to read and design quick scripts! This is great, because data scientists don’t have all day to spend debugging. They do need to spend some time picking out which python languages will work best for their current projects. We at Acheron Analytics have written up a quick list of the 8 most used libraries that can help your next machine learning projects.
P.s....we had a busy week and couldn't get to an actual code example this week as we promised in our last post. However, we are working on that post! We will shortly have an example in R for a from scratch algorithm.
Theano, according to Opensource.com is one of the most heavily used machine learning libraries to date. The great things about Theano, is it is written leaning on mathematical concepts and computer algebra. When the code is compiled it has the ability to to match C level code.
This is due to the fact that it is written to take advantage of how computer compilers work. This in short is how a computer parses and converts tokens into parse trees, how it optimizes and merges similar sub-graphs, using GPU for computations and several other optimizations. For the full list, check out the Theano main page.
For those who used math based languages like Mathamatic and Matlab, the coding structure won’t seem to strange.
What is great, is that Nvidia fully supports Theano and has a few helpful videos on how to use Theano and their GPUs.
When it comes down to it. Machine learning and data science must have good data. How do you handle that data? Well, one great python library is Pandas. It was one of the first data languages many of us were exposed to at Acheron and still has a great following. If you are an R programmer, you will enjoy this language. It allows you to use data frames, which makes thinking about the data you are using much more natural.
Also, if you are a SQL or RDBMS person, this language naturally fits with your tabular view of data. Even if you are more of a Hadoop or MongoDB follower, Pandas just makes life easier.
It doesn’t stop there, it handles missing data, time series, IO and data transformations incredibly well. Thus, if you are trying to prepare your data for analysis, this python language is a must.
We also wanted to share this great python cheat sheet we found, however, we would feel wrong just stick it on our blog. Instead, here is a link to the best python cheat we have found yet! This even beats Datacamp's cheat sheets!
NumPy is another data managing library. Typically you see it paired with Tensorflow, SciPy, matplotlib and so many other python libraries geared towards deep learning and data science. This is because it is built to manage and treat data like matrices. Again, going back to Matlab and R. The purpose is to provide the ability to do complex matrix operations that are required by neural networks and complex statistics easily.
Trying to handle those kind of operations in multi-dimensional arrays or lists is not the most efficient.
Let's say you want to set up an identity matrix? That is one line of code in numpy. Everything about it is geared towards matrices and quick mathematical operations that are done in just a few lines. Coursea has a great course that you can use to further your knowledge about this library.
How to code for an Identity Matrix:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is an odd one out. Scrapy is not a mathematical language, it doesn’t perform data analysis, or deep learning. It does nothing you would think you would want to do in machine learning. However, it does one thing really well. That is, crawl the web. Scrapy is built to be an easy language to develop safe web crawlers (side note, make sure you read all the documentation, it is built to be a safe web crawling library if you configure it right and that is something you have to research).
The web is a great source of unstructured, structured, and visual data. As long as a sight approves of you crawling and doesn’t mind you using their content(which we are not responsible for figuring out) you can gain a lot of insight into topics. You can use libraries that take words and put them into vectors to help perform analysis, or sentiment analysis, etc. It is much more difficult than using straightforward numbers. It is also much richer. There is alot to be gained fom pictures, words, and unstructured data. With that comes the task of getting that information how of the complex data.
That being said Pattern is another specialized web mining scraper. It has tools for Natural Language Processing(NLP), and Machine Learning. It has several built in algorithms and really makes your life as developer much easier!
We have discussed several libraries such as matplotlib, numPy and Pandas and how great they are for machine learning and data science. Now, imagine if you built and easy to use library on top of all of those, as well as several other easy to use libraries. Well, that is what scikit-learn is. It is a compilation of these libraries to create easy access to complex data science algorithms, data visualization techniques.It can be used for clustering, transforming data, dimensional reduction (reducing the number of features that exist), ensemble methods, feature selection and a lot of other classic data science techniques and they are all basically done in a few lines!
The hardest part is making sure you have a virtual python library when you pip install!
matplotlib and ggplot
Now you have done all this analysis, and run all your algorithms. What now? How do you actually turn around value from all this data you have. How do you inspire your executives and tell them “Stories” full of “Insight” etc. If you don’t want to mess around with D3.js, python has you covered! Using Libraries like matplotlib and ggplot. Both are really built to mimic matlab and R functionality. Matplotlib has some great 3D graphs that will help you visualize your knn and PCA algorithms and clusters.
When you are in your data exploration phase, hypothesis, and final product phase of a product. Using these three languages makes life much easier. You can visualize your data, its quirks and your final results!
We have discussed Tensorflow before on this blog when we talked about some common libraries used by data science professionals. It doesn't hurt to talk about it again though! The fact is, if you are in the world of machine learning, you have probably heard, tried, or implemented some form of deep learning algorithm. Are they necessary, not all the time. Are they cool when done right, yes.
Tensorflow and Theano are very similar. The interesting thing about Tensorflow, is that when you are writing in python, you are really only design a graph for the compiler to compile into C++ code and then run on either your CPU or GPU. This is what makes this language so effective and easy to work with. Instead of having to write at the C++ or CUDA level, you can code it all in python first.
The difficulty comes in actually understanding how to properly set up a neural network, convolutional network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data dropout or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).
These are just a few of the most commonly mentioned python libraries that are utilized by academics and professionals. Do you agree? Feel free to share what languages, libraries and tools you use, even if they aren’t python!
During our last post, we discussed a key step in preparing your team for implementing a new data science solution(How to Engineer Your Data). The step following preparing your data is automation. Automation is key to AI and Machine learning. You don’t want to be filling in fields, copy and pasting from Excel, or babying ETLs. Each time data is processed, you want to have some form of automated process that gets kicked off at a regular interval that helps analyze, transform and check your data as it moves from point a to point b.
Before we can go off and discuss analysis, engineering and QA. We must first assess what tools your company uses. Now, the tools you choose to work with for automation are all up to what you are comfortable with.
If you are a linux lover, you will probably pick Crontab and Watch. Windows users will lean towards task scheduler, the end result is the same. You could choose other tools
Once you know what tool will be running your automation, you need to pick some form of scripting language. This could be python, bash, even powershell. Just because it is a scripting language, we still would recommend creating some form of file structure that acts as an organizer. For instance:
This makes it easier on developers past, present and future to follow code when they have to maintain it. Of course, you might have a different file structure, which is great! Just be consistent.
The Set up:
To describe a very basic set up. We would recommend starting out with some form of file landing zone. Whether this is an FTP or a shared drive. Some location where the scripts have access to needs to be set up.
From there, it would be best to have some RDBMS (Mysql, MSSQL, Oracle, etc) that acts as a file tracking system. This will track when new files get placed into your file storage area, what type of file they are, when it was read, etc. Consider this some form of meta table. At the beginning, it can be very basic.
Just have the layout below:
The key for automation is the final column. Having a flag column that distinguishes whether a file has been read or not. There are also other tables you might want around this. For instance, an error table, a dimension table that could contain customers attached to files info, etc.
How does that info get there? An automation script of course! Have some script whose job is to place new file metadata into the system.
Following this, you will have a few other scripts for analysis, data movement and QA that are all separate. This way, if one side fails, you don’t lose all functionality. If you can’t load, you just can’t load and if you can’t process data, you just can’t process it.
When starting any form of data science or machine learning project. The engineers may have limited knowledge of the data they are working with. They might not know what biases exist, missing data, or other quirks of the data. This all needs to be sorted out quickly. If your data science team is manually creating scripts to do this work for each individual data set. They are losing valuable time. Once data sets are assigned, they should be processed by an automated set of scripts that can either be called using a command line prompt, or even better, automatically.
These basic scripts often contain histograms, correlation matrixes, clustering algorithms, and some straight forward algorithms that require 'N' amount of variables and have a specified list of outputs. This could be logistic regression, knn, and Principle Component Analysis(PCA) for starters. In addition, following each model a summary function of some kind can be run. If using R, this is simply summary().
A function example that we have used as part of previous exploration automation:
Basic Correlation Matrix
Data Engineering Phase
Once you have finished exploring your data, it is important to plan how that data will then be stored and what form of analytics can be done on the front end. Can you analyze sentiment, topic focus and value ratios? Do you need to restructure and normalize the data(not the same as statistical normalization).
Guess what! All of this can be automated. Following the explore phase, you can start to design how the system will ingest the data. This will require some manual processing up front to ensure the solution can scale. However, even this should be built in a way that allows for an easy transition to an automated system. Thus, it should be robust, and systemized from the start! That is one of our key driving factors whenever we design a system at Acheron Analytics. It might start being run from command line, but it should easily integrate to being run by task scheduler or cron. This means thinking about the entire process, the variables that will be shared between databases and scripts, the try/catch mechanisms, and possible hiccups along the way.
The system needs to be able to handle failure well. It will allow your team more time to focus on the actual advantages data science, machine learning and even standard analytics provide. Tie this together with a solid logging system, and your team won't have to spend hours or days trouble shooting simple big data errors.
This is one of the most crucial phases for data management and automation. Qing data is a rare skill. Most QAs specialize in software engineering and less in how to test data accuracy. We have had experience watching companies as they try to find a QA with the right skills that match their data processes, or data engineers who are also very good at QAing their own work. It isn’t easy.
Having a test suite built with multiple test cases that run on every new set of data introduced is vital! And if you happen to make it dymaic when new approved data sets are inserted for upper and lower bounds tests...who are we to disagree!
Ensuring all the data that goes into your system automatically can save anywhere several FTE positions. Depending on how large and complex your data is. A good QA system can manage several data applications with a single person.
The question is, what are you checking? If you don’t have a full fledged Data QA on board, this might not be straightforward. So we have a few bullet points to help you get your team thinking about how to set up their data test suites.
What you and your team need to think about when you create test Suites:
Overall, automation helps save your data science and machine learning projects from getting bogged down with basic ETL, and data checking work. This way, your data science teams can make some major insights efficiently, without being limited because of maintenance and reconfiguring tasks. We have seen many teams, both in analytics and data science lose time because of poorly designed processes from the get go. Once a system is plugged into the organization, it is much harder to modify. So make sure to plan automation early!
In the era of data science and AI, it is easy to skip over some crucial steps such as data cleansing. However, this can cause major problems in your applications later down in the data pipeline. The promise of possible magic like data science solutions can overshadow the necessary steps required to get to the best final product. One such step is cleaning and engineering your data before it even gets placed into your system. Truthfully, this is not limited to data science. Whether you are doing data analytics, data science, machine learning, or just old fashioned statistics, data is never whole and pure before refining. Just like putting bad unprocessed petroleum into your car, putting unprocessed data into your company's systems will either immediately, or eventually wreak havoc(Here are some examples). Whether that means actually causing software to fail, or giving executives bad information both are unacceptable.
We at Acheron Analytics wanted to share few tips to ensure that whatever data science/analytics projects you are taking on, you and your team are successful. This post will go over have some brief examples in R, Python and SQL, feel free to reach out with any questions.
Duplicate data is the scourge of any analyst. Whether you are just using excel, Mysql, or Hadoop. Making sure your systems don’t produce duplicate data is key.
There are several sources to duplicate data. The first comes from when the data is input into your companies data storage system. There is a chance that the same data may try to sneak its way in. This could be due to end-user error, a glitch in the system, a bad ETL, etc. All of this should be managed by your data system. Most people still use RDBMS and thus, using a unique key will avoid duplicates being inserted. Sometimes, this may require a combination of fields to check and see if the data being input is a duplicate. For instance, if you are looking at a vendor invoice line item, you probably shouldn’t have the same line item number and header id twice. This can become more complicated when line items change(but even that can be accounted for). If you are analyzing social media post data, each snapshot you take may have the same post id but have altered social interaction data (likes, retweets, shares, etc). This references slowly changing dimensions, which, is another great topic for another time. Feel free to read up more on the topic here.
In both cases, your systems should be calibrated to safely throw out the duplicate data and store the errors in some error table. All of this will save your team time and confusion later.
Besides the actual source data itself having duplicates. The other common duplicate that can occur is based off an analyst's query. If, by chance, they accidentally don’t have a 1:1 or 1 : Many relationship on the key they are joining on, they may find themselves with several times the amount of data you started with. This could be as simple as restructuring your team's query to make sure they properly create 1:1 relationships, or...you may have to completely restructure your database. It is more likely the former option.
How to Get Rid of Duplicate Data in SQL
Has your company ever purchased data from a data aggregator and found it filled with holes? Missing data is common across every industry, sometimes it is just due to system upgrades and new features being added in, sometimes just bad data gathering. Whatever it might be, this can really skew a data science projects results. What are your options then? You could ignore rows with missing data, but this might cost your company valuable insight and including the gaps will produce incorrect conclusions. So, how do you win?
There are few different thoughts on this. One is to simply put a random and reasonable number in place of nothing. This doesn’t really make sense, as it is difficult to really tell what is being driven by what feature. What is a more common and reasonable practice is using the data set average. However, even this is a little misleading. For instance, on one project we were involved with, we were analyzing a large population of users and their sociometric data(income, neighborhood trends, shopping habits). About 15% of the data was missing that was purchased from a credit card carrier. So throwing it away was not in our best interest.
Instead, because we had the persons zipcodes, we were able to aggregate at a local level. This was a judgement call. A good one in this case. We compared this to averaging the entire data set, and we really got a much clearer picture on our populations features. The problem with a general average over several hundred thousand people is that you will eventually have some odd sways. For instance, income, if your data set is a good distribution, you will end up with your average income being, well, average. Then, suddenly, people that may have lived in richer neighborhoods may suddenly create their own classification. The difference between 400k vs 50k(even when normalized) can drastically alter the rest of the features. Does it really make sense for someone who is making 50K a year to be purchasing over 100k of products a year? In the end, we would get a strange cluster that was large spenders, who made average income. When your focus is socio-economic factors. This can cause some major discrepancies.
How to Handle Missing Data with SQL
Data normalization is one of the first critical steps to making sure your data sensible to run in most algorithms. Simply trying to feed in variables that could be anything from age, income, computer usage time, etc, creates the hassle of trying to compare apple to oranges. Trying to input 400k to 40 years will create bad outputs. The numbers just don’t scale. Instead, the concept of normalization allows your data to be more comparable. It takes the max and min of a data set and sets them to the 0 and 1 of a scale. Now, the rest of the numbers can be scaled. Utilizing 0-1 allows your data science teams to meld the data smoother. They are no longer trying to compare scales that don't match. This is a necessary step in most cases to ensure success.
R Progamming Normalization
Python(This can also depend on whether you are using Numpy, Pandas, etc)
Data preparation can be one of the longer steps when preparing your teams data science project. However, once the data is cleaned, checked, and properly shaped, it is much easier to pull out features, and create accurate insights. Preparation is half the battle. Once the data is organized, it becomes several times easier to mold. Good luck with your future data science projects and feel free to give us a ring here in Seattle if you have more questions about your data science projects
Future Learning! And Other Data Transformations
We wanted to supply some more tools to help you learn how to transform and engineer your data. Here is a great video that covers several data transforms. This particular video relies on the R programming language.
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!