Photo by Tabea Damm on Unsplash
Creating an effective data strategy is not as simple as hiring a few data scientists and data engineers and purchasing a tableau license. Nor is it just about using data to make decisions.
Creating an effective data strategy is about creating an ecosystem where getting to the right data, metrics and resources is easy. It’s about developing a culture that learns to question data, and look at a business problem from multiple angles before making the final conclusion.
Our data consulting team has worked with companies from billion dollar tech companies, to healthcare and just about every type of company in between. We have seen the good, the bad and the ugly of data being utilized for strategy. We wanted to share some of the simple changes that can help improve your companies approach to data.
Find A Balance Between Centralized And Decentralized Practices
Standards and over-centralization inevitably slow teams down. Making small changes to tables, databases and schemas might be forced to go through some overly complex process that keep teams from being productive.
On the other hand, centralization can make it easier to implement new changes in strategy without having to go to each team and then force them to take on a new process.
In our opinion, one of the largest advantages companies can gain is developing tools and strategies that help find a happy medium between centralized and decentralized. This usually involves creating standards to simplify development decisions while improving the ability to manage common tasks that every data team needs to perform like documentation and data visualization. While at the same time decentralizing decisions that are often department and domain specific.
Here are some examples where there are opportunities to provide standardized tools and processes for unstandardized topics.
Creating UDFs and Libraries For Similar Metrics
After working in several industries including healthcare, banking and marketing one thing you realize is that many teams are using the same metrics.
This could be across industries or at the very least across internal teams. The problem is every team will inevitably create different methods for calculating the exact same number.
This can lead to duplicate work, code and executives making conflicting decisions because of top-line metrics that vary.
Instead of relying on each team to be responsible for creating a process to calculate the various metrics you could create centralized libraries that uses the same fields to calculate the correct metrics. This standardizes the process while still providing enough flexibility for end-users to develop their reports based off their specific needs.
This only works if the metrics are used consistently. For example in the healthcare industry metrics such as per patient per month costs (PMPM), readmission rates, or bed turn over rates are used consistently. These sometimes are calculated by EMR like EPIC, but might still be calculated by analysts again for more specific cases. It also might be calculated by external consultants.
Creating functions or libraries that do this work easily can help improve consistency and save time. Instead having each team develop their own method you can simply provide a framework that makes it easy to implement the same metrics.
Automate Mundane But Necessary Tasks
Creating an effective data strategy is about making the usage and management of data easy.
A part of this process requires taking mundane tasks that all data teams need to do and automating them.
An example of this is creating documentation. Documentation is an important factor in helping analysts understand the tables and processes they are working with. Having good documentation allows for analysts to perform better analysis. However, documentation is often put off until the last minute or never done at all.
Instead of forcing engineers to document every new table, a great idea is creating a system that automatically scrapes the available databases on a regular interval and keeps track of what tables exist, who created them, what columns they have, and if they have relationships to other tables.
This would be a project for the devops team to take on, or you could also look into a third party system such as dbForge documentation for SQL Server. Now this doesn’t cover everything, and this tool in particular only works for SQL Server. But a similar tool can help simplify a lot of peoples lives.
Teams will still need to describe what the table and columns are. But, the initial work of actually going through and setting up the basic information can all be automatically tracked.
This can help reduce necessary but repetitive work that can help make everyones life a little easier.
Provide Easier Methods To Share And Track Analysis
This is very specifically geared towards data scientist.
Data scientists will often do their work in Jupyter notebooks and Excel that they only have access to. In addition, many companies don’t enforce the need to use some form of repository like git so that data scientists can version control their work.
This limits the ability to share files as well as keep track of changes that can occur in one’s analysis over time.
In this situation, collaboration becomes difficult because co-workers are often stuck passing files back and forth and self version controlling. Typically that looks like files with suffixes like _20190101_final, _20190101_finalfile…
For those of you who don’t get it, you hopefully never will have to.
On top of this, since many of these python scripts utilize multiple libraries it can be a pain to ensure that as you pip install all the correct versions onto your environment.
All of these small difficulties can honestly can cause the loss of a day or two due to trouble shoot depending on how complex the analysis is that you are trying to run.
However, there are plenty of solutions!
There are actually a lot of great tools out there that can help your data science teams collaborate. This includes companies like Domino Data Lab.
Now, you can always use git and virtual environments as well, but this also demands that your data scientist be very proficient with said technologies. This is not always the case.
Again, this allows your teams to work independently but also share their work easily.
Data Cultural Shift
Adding in new libraries and tools is not the only change that needs to happen when you are trying to create a company that is more data driven. A more important and much more difficult shift is cultural.
Changing how people look and treat data is a key aspect that is very challenging. Here are a couple of reasons why.
For those who haven’t read the book, How To Lie With Statistics, spoiler alert, it is really easy to make numbers to tell the story you want.
There are a lot of ways you can do this.
A team can cherry pick the statistics they want to help their agenda triumph. Or perhaps a research team ignores confounding factors and reports on some statistic that seems to be shocking if you don’t consider all the other variables.
Being data driven as a company means that you need to develop a culture that attempts to look at statistics and metrics and ensures there isn’t anything interfering with the number. This is far from easy. When it comes to data science and analytics.
Most metrics and statistics often have some stipulations that could negate whatever message they are trying to say. That is why creating a culture that looks at a metric and asks why is part of the process. If it were as simple as just getting outputs and p-values. Then data scientists would be out of a job because there are plenty of third-party companies that have products that find the best algorithm and do feature selection for you.
But that is not the only job of a data scientist. They are there to question every p-value and really dig into the why of the number they are seeing.
Data Is Still Messy
Truth be told, data is still very messy. Even with todays modern ERPs and applications, data is messy and sometimes bad data gets through that can mislead managers an analysts.
This can be due to a lot of reasons. How the applications manage data, how system admins of those applications modified said system, etc. Even changes that seem insignificant from a business process side can majorly impact how data is stored.
In turn, when data engineers are pulling data they might not accurately be representing data because of bad assumptions and limited knowledge.
This is why just having numbers is not good enough. Teams also need to have a good sense of the business and the process that create said data to ensure they don’t allow data that is messy into the tables which analysts use directly.
Our perspective is that data analysts need confidence that the data they are looking at correctly represents their corresponding businesses processes. If analysts have to remove any data, or consistently perform joins and where clauses to accurately represent the business, then the data is not “self-service”. This is why, whenever data engineers create new data models, they need to work closely with the business to make sure the correct business logic is collected and represented in the base layer of tables.
That way, analysts can have near 100% trust in their data.
At the end of the day, creating an effective data culture requires a both top down and bottom up shift in thinking. From the executive level, decisions need to be made in what key areas they can help make access to data easier. Then teams can start working at becoming more proficient at actually using data to make decisions. We often find most teams spend too much time working on data tasks that need to get done but could be automated. Improving your companies approach to data can provide a large competitive advantage and allow your analysts and data scientists the ability to work on projects they both enjoy and help your bottom line!
If you team needs data consulting help feel free to contact us! If you would like to read more posts about data science and data engineering, Check out the links below!
Using Python to Scrape the Meet-Up API
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!