Classifying Data Creatively
In data science we have plenty of classification algorithms. We have SVM, Logistic Regression and Random Forest to name a few. This often requires that the data sets come pre-labeled.
Who does that pre labeling? How are you supposed to get hundreds of thousands of rows labeled if they don’t naturally come that way?
Part of a data scientists job is not just algorithms and models, but also figuring out how the data will be designed and gathered. If your initiative is new, then it is time to put your thinking cap on and come up with some creative ways to collect data. The four examples below will be a great place to start!
Calculate your Classification
In some cases, you are lucky. The label can be calculated off a metric. For instance, one project we worked on involved readmission of patients. The hospitals were trying to reduce their 30-day readmission rate. Now classification is easy. Yes, this patient was readmitted before the 30 day threshold, or no they were not.
These are the easiest, and often the most straightforward. However, this is relying on the fact that the data you have already might already have the signal you are looking for, this is far from always true! This is why you must plan your data science project out ahead of time. You need to ensure your product is already logging the right data points to ensure you can gain insights. Otherwise, why track the data?
Create A Service or Crowd Source
Google and Netflix are not only really good at machine learning, and data science. They are also really good at having consumers do all their labeling and classification for them. Google's recent AutoDraw and Local Guides are great examples where Google gamified the process of data collection. We may think we are simply drawing silly stick images, or helping others find good food. However, Google is able to take all that information and create even more services and tools with it.
Google knows even more information from its search. Perhaps you have a date coming up and you want to know where to take the person. You probably google “Best date places in city”. You just told google you are dating someone, looking for a place to eat in a specific city. Then, you probably click through a few places google recommends, which will let Google guess what salary you might be making and finally, you probably put it into Google maps. The amount of information Google can deduce and classify from this one interaction is pretty large and you gave it freely. Of course, Google isn't alone at creating a service that gathers data.
For those of you who don’t remember the old Netflix queue, you know, back when you used to only get the movies through the mail. There was almost no, if not no product recommendation. So how did they get all the classification data? Simple, we were inclined to put the movies in the order we wanted them to ensure we got the right movie at the right time. With all this data, Netflix was able to provide us better recommendations to users, and now we can't stop watching. Who knows, it might not be long before they will be able to stick scripts into an algorithm and produce award winning movie scripts in minutes.
Why? Because it was convenient and saves money.
Pay for it
There are plenty of companies out there that you can actually purchase data from. It might come in a little messy (we say that from experience), and it won’t be cheap. However, it can be quite beneficial. We have worked in situations where companies bought data from credit card companies. This can help you tie people in real life with the data you have on them in your systems. Maybe you want to know what these people buy, or how much money your average user base is. Well this classification is almost impossible to find unless you send out a survey. Even then, people have to answer honestly. Thus, purchasing data is not a bad way to go.
The hardest part of this method is matching people on both sides of the data. Unless of course you have clear keys like social security numbers, or maybe full names, birthdays and sex. Those are usually the best ways to match two very different data sets of people.
There is one final method of classifying and labeling data. However, it is the most time consuming for your employees.
Have some poor analyst do it
This last example is real, it happens a lot, and you just feel bad for those poor souls. We were just conversing with a data scientist from a local E-Commerce company and they told us how they had a group of analyst labeling specific html data for 36,000 web pages. It is slow and error prone process. Truthfully, this is the last way a company should do it. Sometimes, there are no other options though.
Truthfully, part of the strategy when implementing a data science project should include how the data will be gathered. Sadly, not everyone is Google, Apple or Facebook that track not only our online interactions, but where we go. We take our phones with us everywhere, and they just continue to gather and reap more data.
To Data Science Program Managers and Leadership
If your a data science team program manager, make sure to think about how you're planning to classify your data ahead of time! Sometimes, you just get the data and have to use. However, if you are so lucky to be able to set up your own project. Be creative! Think about scale and how you can use the power of people and users to classify your data for you. You might even be able to charge for your product or service.
If you need help devising a data gathering or integration strategy, we would love to work with you! Check out our data science services or just send us a question. We have data scientists and programmers with all different backgrounds in projects and specialities. We would love to help.
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!