Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

How To Improve Your Data Driven Strategy

8/11/2019

0 Comments

 
Picture



Photo by Tabea Damm on Unsplash

Creating an effective data strategy is not as simple as hiring a few data scientists and data engineers and purchasing a tableau license. Nor is it just about using data to make decisions.

Creating an effective data strategy is about creating an ecosystem where getting to the right data, metrics and resources is easy. It’s about developing a culture that learns to question data, and look at a business problem from multiple angles before making the final conclusion.

Our data consulting team has worked with companies from billion dollar tech companies, to healthcare and just about every type of company in between. We have seen the good, the bad and the ugly of data being utilized for strategy. We wanted to share some of the simple changes that can help improve your companies approach to data.

Find A Balance Between Centralized And Decentralized Practices

Standards and over-centralization inevitably slow teams down. Making small changes to tables, databases and schemas might be forced to go through some overly complex process that keep teams from being productive.
​
On the other hand, centralization can make it easier to implement new changes in strategy without having to go to each team and then force them to take on a new process.

In our opinion, one of the largest advantages companies can gain is developing tools and strategies that help find a happy medium between centralized and decentralized. This usually involves creating standards to simplify development decisions while improving the ability to manage common tasks that every data team needs to perform like documentation and data visualization. While at the same time decentralizing decisions that are often department and domain specific.

Here are some examples where there are opportunities to provide standardized tools and processes for unstandardized topics.


​Creating UDFs and Libraries For Similar Metrics
​After working in several industries including healthcare, banking and marketing one thing you realize is that many teams are using the same metrics.

This could be across industries or at the very least across internal teams. The problem is every team will inevitably create different methods for calculating the exact same number.

This can lead to duplicate work, code and executives making conflicting decisions because of top-line metrics that vary.

Instead of relying on each team to be responsible for creating a process to calculate the various metrics you could create centralized libraries that uses the same fields to calculate the correct metrics. This standardizes the process while still providing enough flexibility for end-users to develop their reports based off their specific needs.

This only works if the metrics are used consistently. For example in the healthcare industry metrics such as per patient per month costs (PMPM), readmission rates, or bed turn over rates are used consistently. These sometimes are calculated by EMR like EPIC, but might still be calculated by analysts again for more specific cases. It also might be calculated by external consultants.

Creating functions or libraries that do this work easily can help improve consistency and save time. Instead having each team develop their own method you can simply provide a framework that makes it easy to implement the same metrics.

Automate Mundane But Necessary Tasks

Creating an effective data strategy is about making the usage and management of data easy.
A part of this process requires taking mundane tasks that all data teams need to do and automating them.

An example of this is creating documentation. Documentation is an important factor in helping analysts understand the tables and processes they are working with. Having good documentation allows for analysts to perform better analysis. However, documentation is often put off until the last minute or never done at all.

Instead of forcing engineers to document every new table, a great idea is creating a system that automatically scrapes the available databases on a regular interval and keeps track of what tables exist, who created them, what columns they have, and if they have relationships to other tables.
​
This would be a project for the devops team to take on, or you could also look into a third party system such as dbForge documentation for SQL Server. Now this doesn’t cover everything, and this tool in particular only works for SQL Server. But a similar tool can help simplify a lot of peoples lives.
Teams will still need to describe what the table and columns are. But, the initial work of actually going through and setting up the basic information can all be automatically tracked. 
​
This can help reduce necessary but repetitive work that can help make everyones life a little easier.



Provide Easier Methods To Share And Track Analysis

This is very specifically geared towards data scientist.

Data scientists will often do their work in Jupyter notebooks and Excel that they only have access to. In addition, many companies don’t enforce the need to use some form of repository like git so that data scientists can version control their work.

This limits the ability to share files as well as keep track of changes that can occur in one’s analysis over time. 

In this situation, collaboration becomes difficult because co-workers are often stuck passing files back and forth and self version controlling. Typically that looks like files with suffixes like _20190101_final, _20190101_finalfile…

For those of you who don’t get it, you hopefully never will have to.

On top of this, since many of these python scripts utilize multiple libraries it can be a pain to ensure that as you pip install all the correct versions onto your environment.
All of these small difficulties can honestly can cause the loss of a day or two due to trouble shoot depending on how complex the analysis is that you are trying to run.
However, there are plenty of solutions!

There are actually a lot of great tools out there that can help your data science teams collaborate. This includes companies like Domino Data Lab. 

Now, you can always use git and virtual environments as well, but this also demands that your data scientist be very proficient with said technologies. This is not always the case.

Again, this allows your teams to work independently but also share their work easily. 

Data Cultural Shift

Adding in new libraries and tools is not the only change that needs to happen when you are trying to create a company that is more data driven. A more important and much more difficult shift is cultural. 

Changing how people look and treat data is a key aspect that is very challenging. Here are a couple of reasons why.

Data Lies

For those who haven’t read the book, How To Lie With Statistics, spoiler alert, it is really easy to make numbers to tell the story you want.

There are a lot of ways you can do this.

A team can cherry pick the statistics they want to help their agenda triumph. Or perhaps a research team ignores confounding factors and reports on some statistic that seems to be shocking if you don’t consider all the other variables.

Being data driven as a company means that you need to develop a culture that attempts to look at statistics and metrics and ensures there isn’t anything interfering with the number. This is far from easy. When it comes to data science and analytics.
Most metrics and statistics often have some stipulations that could negate whatever message they are trying to say. That is why creating a culture that looks at a metric and asks why is part of the process. If it were as simple as just getting outputs and p-values. Then data scientists would be out of a job because there are plenty of third-party companies that have products that find the best algorithm and do feature selection for you.

But that is not the only job of a data scientist. They are there to question every p-value and really dig into the why of the number they are seeing.

Data Is Still Messy

Truth be told, data is still very messy. Even with todays modern ERPs and applications, data is messy and sometimes bad data gets through that can mislead managers an analysts.

This can be due to a lot of reasons. How the applications manage data, how system admins of those applications modified said system, etc. Even changes that seem insignificant from a business process side can majorly impact how data is stored.

In turn, when data engineers are pulling data they might not accurately be representing data because of bad assumptions and limited knowledge. 

This is why just having numbers is not good enough. Teams also need to have a good sense of the business and the process that create said data to ensure they don’t allow data that is messy into the tables which analysts use directly. 

Our perspective is that data analysts need confidence that the data they are looking at correctly represents their corresponding businesses processes. If analysts have to remove any data, or consistently perform joins and where clauses to accurately represent the business, then the data is not “self-service”. This is why, whenever data engineers create new data models, they need to work closely with the business to make sure the correct business logic is collected and represented in the base layer of tables.
​
That way, analysts can have near 100% trust in their data.

Conclusion

At the end of the day, creating an effective data culture requires a both top down and bottom up shift in thinking. From the executive level, decisions need to be made in what key areas they can help make access to data easier. Then teams can start working at becoming more proficient at actually using data to make decisions. We often find most teams spend too much time working on data tasks that need to get done but could be automated. Improving your companies approach to data can provide a large competitive advantage and allow your analysts and data scientists the ability to work on projects they both enjoy and help your bottom line!

If you team needs data consulting help feel free to contact us! If you would like to read more posts about data science and data engineering, Check out the links below!

Using Python to Scrape the Meet-Up API
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
0 Comments

Using Python to Scrape the Meet-Up API

8/9/2019

0 Comments

 
Picture
We recently posted some ideas for projects you could take on, to add to your resume and help you learn more about programming.
​
One of those projects involved scraping the Meet-up and Eventbrite APIs to create an aggregate site of events.

This is a great project and it opens up the opportunity to take on several concepts. You could use this idea to make an alerting system — the user inputs their API keys to track local events they have an interest in. You could develop a site to predict which live acts will become highly popular, before they get there, by tracking metrics over time.

Honestly, the APIs give a decent amount of data, even to the point of giving you member names (and supposedly emails too, if the member is authenticated). It’s a lot of fun — you can use this data for the basis of your own site!

To StartTo start this project, break down the basic pieces you will need to build the backend. More than likely you will need:
  • An API “Scraper”
  • Database interface
  • Operational database
  • Data-warehouse (optional)
  • ORM
To start out you will need to develop a scraper class. This class should be agnostic of the specific API call you’re making. That way, you can avoid having to make a specific class or script for each call. In addition, when the API changes, you won’t have to spend as much time going through every script to update every variable.

Instead, you’ll only need to go through and update the configurations.
That being said, we don’t recommend trying to develop a perfectly abstract class right away. Trying to build a perfectly abstracted class that has no hard-coded variables, from the beginning, can be difficult. If anything goes wrong or doesn’t work then it is harder to debug because of the layers of abstraction.

We’ll start by trying to develop pieces that work.

The first decision you need to make is where the scraper will be putting the data. We’re creating a folder structure in which each day has its own folder.

You could use a general folder on a server, S3, or a similar raw file structure. These offer the ability to easily store the raw data that we’re storing in a JSON file. Other data storage methods, like csv and tsv, are thrown off by the way the description data is formatted.

Let’s took a look at the basic script. Think about how you could start better configuring and refactoring the codebase to be better developed.
One place right off the bat is the API key. While you’re testing it’s easy to hard-code your own API key. But if your eventual goal is to allow multiple users to gain access to this data then you will want their API keys set up.

The next portion you will want to update is the hardcoded references to data you are pulling. This hard-coding limits the code to only work with one API call. One example of this is how we pull the different endpoints and reference what fields you would like to pull from what is returned.

For this example, we are just dumping everything in JSON. Perhaps you want to be very choosy — in that case, you might want to configure what columns are attached to each field.
​
For example:
This allows you to create a scraper that is agnostic of which API event you will be using. It puts the settings outside the code, which can be easier to maintain.

For example, what happens if Meet-up changes the API endpoints or column names? Well, instead of having to go into 10 different code files you can just change the config file.

The next stage is creating a database and ETL, to load and store all the data, and a system that automatically parses the data from the JSON files into an operational style database. This database can be used to help track events that you might be interested in. In addition, creating a data warehouse could help track metrics.

Perhaps you’re interested in the rate at which events have people RSVP, or how quickly events get sold out.

Based on that you could analyze what types of descriptions or groups that quickly run out of slots.
Personally, there is a lot of fun analysis you could take on.

Over the next few weeks and months, we’ll be working to continue developing this project. This includes building a database, maybe doing some analysis, and more!

We hope you enjoyed this piece!

If you enjoyed this video about software engineering then consider these videos as well!
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
​
0 Comments
    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners