Unstructured Data, and How to Analyze it!
Content creation and promotion can play a huge role in a company's success on getting their product out there. Think about Star Wars and Marvel. Both of these franchises are just as much commercials for their merchandise, as they are just plain high quality content.
Companies post blogs, make movies, even run pinterest accounts. All of this produces customer responses and network reactions that can be analyzed, melded with current data sets and run through various predictive models to help a company better target users, produce promotional content, and alter products and services to be more in tune with the customer.
Developing a machine learning model can be done by finding value and relationships in all the different forms of data your content produces, segmenting your users and responders, and melding all your data together. In turn, your company can gain a lot more information, besides the standard balance sheet data(see picture above).
Change Words to Numbers
Machine learning has created a host of libraries that can simplify the way your team performs data analysis. In fact, python has several libraries that allow programmers with high level knowledge of data science and machine learning application design and implementation the opportunity to produce fast and meaningful analysis.
One great Python library that can take content data like blogs posts, news articles, and social media posts is TextBlob. TextBlob has some great functions like
“Scary Monsters love to eat tasty, sweet apples”
You can use the lines below to pull out the nouns and what was used to describe said nouns.
How to use TextBlob to Analyze Text Data
This takes data that is very unstructured and hard to analyze, and begins to create a more analysis friendly data sets. Other great uses of this library are projects such as chat bots
From here, you can combine polarity, positivity, shares, topic focus to see what type of social media posts, blog posts, etc, become the most viral.
Another library worth checking out are word2vec which exists in Python, R, Java, etc. For instance, check out deeplearning4j.
Marketing Segmentation with Data Science
Social media allows for once hard to get data such as, people's opinions on products, their likes, dislikes, gender, location, and job to be much more accessible. Sometimes you may have to purchase it, other times, some sites are kind enough to allow you to take it freely.
In either case, this allows companies an open door to segmenting markets with much finer detail. This isn’t based off of small surveys that only have 1000 people, we are talking about millions, and billions of people. Yes, there is a lot more data scrubbing required. But there is an opportunity to segment individuals, and use their networks to support your company's products.
One example is a tweet we once passed off to SQL Server. They quickly responded. Now, based off the fact that we interacted with SQL Server and talk so much about data science and data. You probably can assume we are into technology, databases, etc. This is basically what twitter, facebook, Google, etc do to place the right ads in front of you. They also combine cookies, and other data sources like geolocation.
If you worked for Oracle, perhaps you would want me to see some posts about the benefits of switching to Oracle, or ask for my opinion on why someone prefers(we personally have very little preference, as we have used both, and find both useful) using SQL Server over Oracle. Whatever it may be, there are opportunities to swing customers. Now what if your content was already placed in front of the right people. Maybe you tag a user, or ask them to help you out or join your campaign! Involve them, see how you can help them.
For instance, bloggers are always looking for ways to get their content out their. If your company involves them, or partners with them in a transparent way. Your product now has access to a specific network. Again, another great place where data science and basics statistics come into play.
If you haven’t tried tools like NodeXL, it is a great example of developing a model to find strong influencers in specific networks. This tool is pretty nifty. However, it is limited. So you might want to make some of your own.
Utilizing the data gathered from various sites, and algorithms like K nearest neighbor, PCA, etc. You can find the words used in profiles, posts and shares, the company's customers interact with, etc. Then:
The lists goes, on. It may be better to start with NodeXL, just to see what you are looking for.
Now what is the value of doing all this analysis, data melding, and analytics?
ROI Of Content:
At the end of the day, you have plenty of questions to answer.
These aren’t the easiest question to answer. However, here is where you can help turn the data from your social presence into value for your company:
Typical predictive analytics utilize standard business data(balance sheet, payroll, CRM, and operational data). This limits companies to the “what” happened, and not the why. Managers will ask why did the company see that spike in Q2? Or dip or Q3? It is difficult to paint a picture when you are only looking at the data that has very little insight into the why. Simply doing a running average isn’t always great and putting in seasonal factors is limited to domain knowledge.
However, data has grown, and now, having access to the “Why” is much more plausible. Everything from social media, to CRMs to online news provide much better insight into why your customers are coming or going!
This data has a lot of noise, and it wouldn’t really be worth it for humans to go through it.. This is where having an automated exploratory system developed will help out a lot.
Finding correlations between content, historical news, and company internal data would take analyst's years. By the time they found any value, the moment would have passed.
Instead, having a correlation discovery system that is automated will save your company time, and be much better at finding value. You can use this system to find those small correlating factors that play a big effect. Maybe your customers are telling you what is wrong with your product, and you just aren’t listening. Maybe, you find a new product idea.
In the Acheron Analytics process, this would be part of our second and third phase. We always look for as many possible correlations, and then develop hypotheses and prototypes that leads to company value.
This process allows companies to have data help define their next steps. This provides their managers with data defended plans. Ones that they can go confidently to their managers with.
When it comes to analyzing your company's content and marketing investments, utilizing techniques like machine learning, sentiment analysis, segmentation which can help develop data driven marketing strategies.
We hope this inspired some ideas how to meld your company’s data! Let us know if you have any questions.
Python is a great language for developers and scripters alike. It allows for some large scale design and OOP concepts. However, it was also developed to be very easy to read and design quick scripts! This is great, because data scientists don’t have all day to spend debugging. They do need to spend some time picking out which python languages will work best for their current projects. We at Acheron Analytics have written up a quick list of the 8 most used libraries that can help your next machine learning projects.
P.s....we had a busy week and couldn't get to an actual code example this week as we promised in our last post. However, we are working on that post! We will shortly have an example in R for a from scratch algorithm.
Theano, according to Opensource.com is one of the most heavily used machine learning libraries to date. The great things about Theano, is it is written leaning on mathematical concepts and computer algebra. When the code is compiled it has the ability to to match C level code.
This is due to the fact that it is written to take advantage of how computer compilers work. This in short is how a computer parses and converts tokens into parse trees, how it optimizes and merges similar sub-graphs, using GPU for computations and several other optimizations. For the full list, check out the Theano main page.
For those who used math based languages like Mathamatic and Matlab, the coding structure won’t seem to strange.
What is great, is that Nvidia fully supports Theano and has a few helpful videos on how to use Theano and their GPUs.
When it comes down to it. Machine learning and data science must have good data. How do you handle that data? Well, one great python library is Pandas. It was one of the first data languages many of us were exposed to at Acheron and still has a great following. If you are an R programmer, you will enjoy this language. It allows you to use data frames, which makes thinking about the data you are using much more natural.
Also, if you are a SQL or RDBMS person, this language naturally fits with your tabular view of data. Even if you are more of a Hadoop or MongoDB follower, Pandas just makes life easier.
It doesn’t stop there, it handles missing data, time series, IO and data transformations incredibly well. Thus, if you are trying to prepare your data for analysis, this python language is a must.
We also wanted to share this great python cheat sheet we found, however, we would feel wrong just stick it on our blog. Instead, here is a link to the best python cheat we have found yet! This even beats Datacamp's cheat sheets!
NumPy is another data managing library. Typically you see it paired with Tensorflow, SciPy, matplotlib and so many other python libraries geared towards deep learning and data science. This is because it is built to manage and treat data like matrices. Again, going back to Matlab and R. The purpose is to provide the ability to do complex matrix operations that are required by neural networks and complex statistics easily.
Trying to handle those kind of operations in multi-dimensional arrays or lists is not the most efficient.
Let's say you want to set up an identity matrix? That is one line of code in numpy. Everything about it is geared towards matrices and quick mathematical operations that are done in just a few lines. Coursea has a great course that you can use to further your knowledge about this library.
How to code for an Identity Matrix:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is an odd one out. Scrapy is not a mathematical language, it doesn’t perform data analysis, or deep learning. It does nothing you would think you would want to do in machine learning. However, it does one thing really well. That is, crawl the web. Scrapy is built to be an easy language to develop safe web crawlers (side note, make sure you read all the documentation, it is built to be a safe web crawling library if you configure it right and that is something you have to research).
The web is a great source of unstructured, structured, and visual data. As long as a sight approves of you crawling and doesn’t mind you using their content(which we are not responsible for figuring out) you can gain a lot of insight into topics. You can use libraries that take words and put them into vectors to help perform analysis, or sentiment analysis, etc. It is much more difficult than using straightforward numbers. It is also much richer. There is alot to be gained fom pictures, words, and unstructured data. With that comes the task of getting that information how of the complex data.
That being said Pattern is another specialized web mining scraper. It has tools for Natural Language Processing(NLP), and Machine Learning. It has several built in algorithms and really makes your life as developer much easier!
We have discussed several libraries such as matplotlib, numPy and Pandas and how great they are for machine learning and data science. Now, imagine if you built and easy to use library on top of all of those, as well as several other easy to use libraries. Well, that is what scikit-learn is. It is a compilation of these libraries to create easy access to complex data science algorithms, data visualization techniques.It can be used for clustering, transforming data, dimensional reduction (reducing the number of features that exist), ensemble methods, feature selection and a lot of other classic data science techniques and they are all basically done in a few lines!
The hardest part is making sure you have a virtual python library when you pip install!
matplotlib and ggplot
Now you have done all this analysis, and run all your algorithms. What now? How do you actually turn around value from all this data you have. How do you inspire your executives and tell them “Stories” full of “Insight” etc. If you don’t want to mess around with D3.js, python has you covered! Using Libraries like matplotlib and ggplot. Both are really built to mimic matlab and R functionality. Matplotlib has some great 3D graphs that will help you visualize your knn and PCA algorithms and clusters.
When you are in your data exploration phase, hypothesis, and final product phase of a product. Using these three languages makes life much easier. You can visualize your data, its quirks and your final results!
We have discussed Tensorflow before on this blog when we talked about some common libraries used by data science professionals. It doesn't hurt to talk about it again though! The fact is, if you are in the world of machine learning, you have probably heard, tried, or implemented some form of deep learning algorithm. Are they necessary, not all the time. Are they cool when done right, yes.
Tensorflow and Theano are very similar. The interesting thing about Tensorflow, is that when you are writing in python, you are really only design a graph for the compiler to compile into C++ code and then run on either your CPU or GPU. This is what makes this language so effective and easy to work with. Instead of having to write at the C++ or CUDA level, you can code it all in python first.
The difficulty comes in actually understanding how to properly set up a neural network, convolutional network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data dropout or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).
These are just a few of the most commonly mentioned python libraries that are utilized by academics and professionals. Do you agree? Feel free to share what languages, libraries and tools you use, even if they aren’t python!
In the era of data science and AI, it is easy to skip over some crucial steps such as data cleansing. However, this can cause major problems in your applications later down in the data pipeline. The promise of possible magic like data science solutions can overshadow the necessary steps required to get to the best final product. One such step is cleaning and engineering your data before it even gets placed into your system. Truthfully, this is not limited to data science. Whether you are doing data analytics, data science, machine learning, or just old fashioned statistics, data is never whole and pure before refining. Just like putting bad unprocessed petroleum into your car, putting unprocessed data into your company's systems will either immediately, or eventually wreak havoc(Here are some examples). Whether that means actually causing software to fail, or giving executives bad information both are unacceptable.
We at Acheron Analytics wanted to share few tips to ensure that whatever data science/analytics projects you are taking on, you and your team are successful. This post will go over have some brief examples in R, Python and SQL, feel free to reach out with any questions.
Duplicate data is the scourge of any analyst. Whether you are just using excel, Mysql, or Hadoop. Making sure your systems don’t produce duplicate data is key.
There are several sources to duplicate data. The first comes from when the data is input into your companies data storage system. There is a chance that the same data may try to sneak its way in. This could be due to end-user error, a glitch in the system, a bad ETL, etc. All of this should be managed by your data system. Most people still use RDBMS and thus, using a unique key will avoid duplicates being inserted. Sometimes, this may require a combination of fields to check and see if the data being input is a duplicate. For instance, if you are looking at a vendor invoice line item, you probably shouldn’t have the same line item number and header id twice. This can become more complicated when line items change(but even that can be accounted for). If you are analyzing social media post data, each snapshot you take may have the same post id but have altered social interaction data (likes, retweets, shares, etc). This references slowly changing dimensions, which, is another great topic for another time. Feel free to read up more on the topic here.
In both cases, your systems should be calibrated to safely throw out the duplicate data and store the errors in some error table. All of this will save your team time and confusion later.
Besides the actual source data itself having duplicates. The other common duplicate that can occur is based off an analyst's query. If, by chance, they accidentally don’t have a 1:1 or 1 : Many relationship on the key they are joining on, they may find themselves with several times the amount of data you started with. This could be as simple as restructuring your team's query to make sure they properly create 1:1 relationships, or...you may have to completely restructure your database. It is more likely the former option.
How to Get Rid of Duplicate Data in SQL
Has your company ever purchased data from a data aggregator and found it filled with holes? Missing data is common across every industry, sometimes it is just due to system upgrades and new features being added in, sometimes just bad data gathering. Whatever it might be, this can really skew a data science projects results. What are your options then? You could ignore rows with missing data, but this might cost your company valuable insight and including the gaps will produce incorrect conclusions. So, how do you win?
There are few different thoughts on this. One is to simply put a random and reasonable number in place of nothing. This doesn’t really make sense, as it is difficult to really tell what is being driven by what feature. What is a more common and reasonable practice is using the data set average. However, even this is a little misleading. For instance, on one project we were involved with, we were analyzing a large population of users and their sociometric data(income, neighborhood trends, shopping habits). About 15% of the data was missing that was purchased from a credit card carrier. So throwing it away was not in our best interest.
Instead, because we had the persons zipcodes, we were able to aggregate at a local level. This was a judgement call. A good one in this case. We compared this to averaging the entire data set, and we really got a much clearer picture on our populations features. The problem with a general average over several hundred thousand people is that you will eventually have some odd sways. For instance, income, if your data set is a good distribution, you will end up with your average income being, well, average. Then, suddenly, people that may have lived in richer neighborhoods may suddenly create their own classification. The difference between 400k vs 50k(even when normalized) can drastically alter the rest of the features. Does it really make sense for someone who is making 50K a year to be purchasing over 100k of products a year? In the end, we would get a strange cluster that was large spenders, who made average income. When your focus is socio-economic factors. This can cause some major discrepancies.
How to Handle Missing Data with SQL
Data normalization is one of the first critical steps to making sure your data sensible to run in most algorithms. Simply trying to feed in variables that could be anything from age, income, computer usage time, etc, creates the hassle of trying to compare apple to oranges. Trying to input 400k to 40 years will create bad outputs. The numbers just don’t scale. Instead, the concept of normalization allows your data to be more comparable. It takes the max and min of a data set and sets them to the 0 and 1 of a scale. Now, the rest of the numbers can be scaled. Utilizing 0-1 allows your data science teams to meld the data smoother. They are no longer trying to compare scales that don't match. This is a necessary step in most cases to ensure success.
R Progamming Normalization
Python(This can also depend on whether you are using Numpy, Pandas, etc)
Data preparation can be one of the longer steps when preparing your teams data science project. However, once the data is cleaned, checked, and properly shaped, it is much easier to pull out features, and create accurate insights. Preparation is half the battle. Once the data is organized, it becomes several times easier to mold. Good luck with your future data science projects and feel free to give us a ring here in Seattle if you have more questions about your data science projects
Future Learning! And Other Data Transformations
We wanted to supply some more tools to help you learn how to transform and engineer your data. Here is a great video that covers several data transforms. This particular video relies on the R programming language.
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!