Classifying Data Creatively
In data science we have plenty of classification algorithms. We have SVM, Logistic Regression and Random Forest to name a few. This often requires that the data sets come pre-labeled.
Who does that pre labeling? How are you supposed to get hundreds of thousands of rows labeled if they don’t naturally come that way?
Part of a data scientists job is not just algorithms and models, but also figuring out how the data will be designed and gathered. If your initiative is new, then it is time to put your thinking cap on and come up with some creative ways to collect data. The four examples below will be a great place to start!
Calculate your Classification
In some cases, you are lucky. The label can be calculated off a metric. For instance, one project we worked on involved readmission of patients. The hospitals were trying to reduce their 30-day readmission rate. Now classification is easy. Yes, this patient was readmitted before the 30 day threshold, or no they were not.
These are the easiest, and often the most straightforward. However, this is relying on the fact that the data you have already might already have the signal you are looking for, this is far from always true! This is why you must plan your data science project out ahead of time. You need to ensure your product is already logging the right data points to ensure you can gain insights. Otherwise, why track the data?
Create A Service or Crowd Source
Google and Netflix are not only really good at machine learning, and data science. They are also really good at having consumers do all their labeling and classification for them. Google's recent AutoDraw and Local Guides are great examples where Google gamified the process of data collection. We may think we are simply drawing silly stick images, or helping others find good food. However, Google is able to take all that information and create even more services and tools with it.
Google knows even more information from its search. Perhaps you have a date coming up and you want to know where to take the person. You probably google “Best date places in city”. You just told google you are dating someone, looking for a place to eat in a specific city. Then, you probably click through a few places google recommends, which will let Google guess what salary you might be making and finally, you probably put it into Google maps. The amount of information Google can deduce and classify from this one interaction is pretty large and you gave it freely. Of course, Google isn't alone at creating a service that gathers data.
For those of you who don’t remember the old Netflix queue, you know, back when you used to only get the movies through the mail. There was almost no, if not no product recommendation. So how did they get all the classification data? Simple, we were inclined to put the movies in the order we wanted them to ensure we got the right movie at the right time. With all this data, Netflix was able to provide us better recommendations to users, and now we can't stop watching. Who knows, it might not be long before they will be able to stick scripts into an algorithm and produce award winning movie scripts in minutes.
Why? Because it was convenient and saves money.
Pay for it
There are plenty of companies out there that you can actually purchase data from. It might come in a little messy (we say that from experience), and it won’t be cheap. However, it can be quite beneficial. We have worked in situations where companies bought data from credit card companies. This can help you tie people in real life with the data you have on them in your systems. Maybe you want to know what these people buy, or how much money your average user base is. Well this classification is almost impossible to find unless you send out a survey. Even then, people have to answer honestly. Thus, purchasing data is not a bad way to go.
The hardest part of this method is matching people on both sides of the data. Unless of course you have clear keys like social security numbers, or maybe full names, birthdays and sex. Those are usually the best ways to match two very different data sets of people.
There is one final method of classifying and labeling data. However, it is the most time consuming for your employees.
Have some poor analyst do it
This last example is real, it happens a lot, and you just feel bad for those poor souls. We were just conversing with a data scientist from a local E-Commerce company and they told us how they had a group of analyst labeling specific html data for 36,000 web pages. It is slow and error prone process. Truthfully, this is the last way a company should do it. Sometimes, there are no other options though.
Truthfully, part of the strategy when implementing a data science project should include how the data will be gathered. Sadly, not everyone is Google, Apple or Facebook that track not only our online interactions, but where we go. We take our phones with us everywhere, and they just continue to gather and reap more data.
To Data Science Program Managers and Leadership
If your a data science team program manager, make sure to think about how you're planning to classify your data ahead of time! Sometimes, you just get the data and have to use. However, if you are so lucky to be able to set up your own project. Be creative! Think about scale and how you can use the power of people and users to classify your data for you. You might even be able to charge for your product or service.
If you need help devising a data gathering or integration strategy, we would love to work with you! Check out our data science services or just send us a question. We have data scientists and programmers with all different backgrounds in projects and specialities. We would love to help.
Is your company looking to figure out who should become data scientists and how to start a team? You are not alone, even Amazon and Airbnb are starting internal universities to teach more of their teams the values of data science. Maybe your company needs help setting up some internal classes to help increase your data science an machine learning skill sets. Acheron provides multiple forms of internal education programs. They can be for managers, or analysts. One form is a quick guide to how to run a data science team! This a for managers and executives who are starting, or already have a data science team and want to ensure they are getting the best return on investment from their team and that their team members all feel challenged!
We took one sub section out and wanted to share a common question we get when we talk to executives. Who are data scientists, and who should become one! One such client told us they have loads of scientists, but wasn't sure how to turn them into data scientists, and who in their cohorts should really become one.
Below we will go over some of the top soft skills data scientists should have, and what type of personality should someone have before they enroll in some form of data science program. Whether this be an internal program, or external, like Galvanize, or a university data science certificate. In the end, data science is a skill that companies will need to harness to make sure they can keep up with the rest of their competitors who are already successfully implementing data science into their upper level strategy.
Who are Data Scientists?
Data scientist have to be driven individuals. They not only must be technically savvy, they also need to be proactively aware of their company’s nuances. If they happen to see a correlation or pattern, they will seek out how to access the data required and will bring possible projects up to their manager.
Being driven is great, especially when combined with curiosity. Data scientists love to ask why, and not stop until they find out the root cause. They are great at pinpointing that actual patterns in the noise. This is a necessary skill in order to peel apart the complexity and relationships various data sets may have. Occasionally, an individual may have a curious mind, but may lack the drive to act upon their inquiries.
Tolerance of Failure
Data science has a lot of similarities to the science field. In the sense that there might be 99 failed hypotheses that lead to 1 successful solution. Some data driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations every year to year and a half. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc). This means a data scientists must be willing to fail fast and often. Similar to using the agile methodology. They have to constantly test, retest, and prove that their algorithms are correct.
The term data storyteller has become correlated with data scientist. This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields. This requires taking complex mathematical and technological concepts and creating clear and concise messages that executives can act upon. Not just hiding behind their jargon, but actually transcribing their complex ideas into business speak.
Creative and Abstract Thinking
Creativity and abstract thinking helps data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases. Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires thinking outside of the box.
Data scientists have to be able to take large problems, like what ad to show to which customer, then based off of hundreds of variables effectively find the right solution. This means taking a larger problem and breaking it down to its smallest parts. Getting rid of noise, and variables that don’t help create a clear pattern. This can sometimes be a messy process. Being able to keep focused on the bigger problem is key.
Who Should Become a Data Scientist
The skills required to be a data scientist are constantly evolving and many companies are trying to find out how to train new data scientists. In the end, the real question is, who should become a data scientist?
Data science requires constant learning. Not just technology, but it also requires constant learning of new fields, specialties and situations. Especially as data science solutions further integrates into more and more departments of corporations. Becoming familiar with one set of vocabulary, and processes is not an option. Without having some bearing in each field limits the hypothesis and logical assumptions required to be made by a good data scientist.
If you are searching for a data scientist inside your company. They are probably already attempting to push into the field. With all the online material, classes, and meet-ups, an individual would have already taken steps to get more involved. If they merely talk about it, but never act upon it, they will act similarly on a new project or idea.
There is some requirement for computational or technical abilities. Excel is a great tool, but there is a need to be able to use more powerful and customizable tools. This includes programming, data visualization and data storage tools. There is no need to be a software engineer. However, data scientists have a general idea of how to make sure code is maintainable, robust and scalable.
Looking to start a data science team?
If you are looking to start a team of your own. Feel free to comment, or email us! We can do everything from point you in the right direction of readings if you want to do it yourself, to come and join you on your journey! Also, feel free to follow our blog. We will keep it up to date as we do new projects, and new questions about data science! If you email us a question, we will try to post about it!
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!