Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

SQL Best Practices — Designing An ETL Video

2/9/2019

1 Comment

 
Data engineering has many facets. One of the most common projects a data engineer takes on is developing an ETL pipeline from an operational DB to a data warehouse. Our team wanted to cover the overarching design of an ETL.

What are the typical principal components, stages, considerations, etc?

We started this by first writing Creating An ETL Part 1(more to come) and we and now have worked on a video that is below that walks through the process. We wanted to discuss why each stage is important and what occurs when data goes from raw to stage, why do we need a raw database and so on.

Data engineering is a complex discipline that partners automation, programming, system design, databases, and analytics in order to ensure that analysts, data scientists and end-users have access to clean data.
​
This all starts with the basic ETL design.
We are practicing up for a webinar that we will be hosting on ETL development with Python and SQL. The webinar itself will be much more technical and dive much deeper into each component described in the video. However, we wanted to see how using the whiteboard was like in case we need it.
If you would like to sign up for the free webinar we will be hosting it on February 23rd at 10 AM PT. Feel free to sign up here! If you have other questions please do contact us.

Read more about data science, healthcare analytics and automation below:
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
Time Series Model Development: Basic Models
How To Measure The Accuracy Of Predictive Models
What is A Decision Tree

1 Comment

Best Practices — Creating An ETL Part 1

2/3/2019

0 Comments

 
An ETL (and it’s not so far off cousin ELT) is a concept that is not usually taught in college, at least not in undergrad courses. Yet, ELTs play an important piece of almost every company’s day to day operations.

ETLs are the pipelines that populate data into business dashboards and algorithms that provide vital insights and metrics to managers. These pipelines go from very simple processes that extract data from operational databases into a data warehouse to complex analytical layers that support more complex transformations and applications.

The problem is, many of our IT graduates are not taught this very important skill that they will more than likely use in their everyday life if they decide to pursue a career in data science or data engineering. Although data science work doesn’t really include ETLs technically, a lot of work they do can end up crossing over with data engineers.
​
We wanted to cover the basics of an ETL and good ETL design.
What does a good ETL look like, what are the key components, what makes a pipeline maintainable and healthy? These are all important questions data engineers and database managers need to ask before taking on new projects.

This will be the first of several posts. In this post, we will be covering some basic concepts such as extracting/loading raw data and logging. These are important topics that have a lot of nuances that we would like to start to cover

We will also be offering a free webinar February 23 10 AM PT to cover this topic. We would love to have you attended!

Operational To Raw

Operational Database
All of the data your ETL will be ingesting will have some source that represents real life transactions of one sort of another. The data could represent patients visiting hospitals, people posting on social media or customers purchasing products. Some might ask why you can’t just analyze data from the operational database instead of going through the extra steps of processing the data into another system.

The biggest issue is that trying to run analytical processes that aggregate and analyze large data set from the operational database will slow down the operational database for customers. In addition, operational databases are heavily normalized which make them inefficient to query from an analytical standpoint. This is why a data warehouse of one form or another is devel

Here is an example of a method for bulk insert data into SQL Server

The first step in the processes is to extract the data into some form of raw format.

Raw File Extracts
How the data is pulled from the operational database can vary. However, one of the most consistent ways is to extract the data into a raw file like a CSV or XML file. Extracting the data into a raw file has several benefits. One, a direct link that pulls data limits the interactions outside systems have with the operational database. This will avoid blocking important operational tasks that need to occur. In addition, it creates a snapshot of what the data looked like at the time of pulling. This can be beneficial if you ever need to reload data because of an issue that might occur in the pipeline.

Having this back-up of data is beneficial just in case someone accidentally deletes crucial data. Then you will be very happy to have raw extracts that are easy to reload. These raw extracts should be tracked in some table that keeps track of what files have been loaded and where they exist. That way you can easily automate any reloading. Also, the files should have a naming standard.

An easy example would be DateExtracted_DateRangeInTheFile_BusinessObject (e.g. 20170101_20160901_20161231_Customers.csv)

Metadatabase And Logging
Tracking what is occurring inside your ETL process is a crucial step in developing an automated, maintainable and robust system. How extensive your tracking and logging depends on how large your company is and how many ETLs/ELTs and other data transformations need to occur. Larger companies might have a standardized tool like Airflow to help manage DAGs and logging.

However, if you are a start-up or a non-tech company, it will probably be ok to have a simplified logging system. The purpose of this system is to manage the files that need to be loaded/have been loaded, tracking of what stored procedures have run, errors, dependencies, etc.

Having good insight into these aspects of your system will help maintain it, improve it and backfill any data that either has been missed or loaded incorrectly. This means you need to actually create 3–4 tables (at least) that track what is running, how long do they take, computation, etc. Especially as we push for more and more automated systems, this will be the key to making sure the systems are maintainable.

Loading Raw Data Into Raw tables
Once you have extracted the raw data from the operational databases and updated your metatable with the new file information, then you can start focusing on loading the data.

It can be tempting to try to add a lot of complex business logic to your initial loads into raw. It can seem to simplify the process and create fewer steps in order to load the data. So why add more steps. The truth is adding too much logic to early can make it difficult to track where errors occur. Did the error occur in the extract, in the business logic or somewhere else? In addition, it becomes more difficult to update new logic because you have to analyze where the best place to update the logic will be.

All of these extra considerations make the overall system more difficult to maintain. That is why it is important to just load the raw data as is. If there is de-duplication logic or mapping that needs to happen then it can happen in the staging portion of the pipeline.

The next steps after loading the data to the raw database are QA and loading data into the staging database. We will continue that discussion in our next post. If you want to learn more about ETL development/automation then sign up for our ETL webinar where we will be discussing the process of creating an ETL from end to end.

Read more about data science, healthcare analytics and automation below:
Calculating A Running Total SQL
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
Time Series Model Development: Basic Models
How To Measure The Accuracy Of Predictive Models
What is A Decision Tree
Automation Best Practices
0 Comments

Calculating A Bucketed Median With SQL

1/29/2019

0 Comments

 
Calculating the median has never seemed to be straight forward in SQL. Some SQL engines don’t even have a direct median function. Typically, you run a function on data that is raw so at least you know how the aggregated data looks.

However, let’s throw in a curveball. What if all the data that was already aggregated and you weren’t able to access the raw data. Instead, you only had a table that was high level.
​

For example, let’s say the data looked like the data below.
Picture
It might seem strange to have data stored this way, but there are cases where condensing the data in this fashion might make sense. Let’s say you’re team is storing billions of transactions per day and you want to run a median calculation over several years.

In this case, it makes more sense to keep the data aggregated because it will improve the performance. You might not even be able to properly store the transactions if it isn’t aggregated.
It could be tempting to “unroll” the data so to speak. To take the data and create a new table that has a row for each of the instances of a person at a specific age. This could cause your rows to grow significantly, to terabytes upon terabytes. Also is just another step that might not need to be done.
Instead, let’s look this query.

SELECT TOP(1) Age
   FROM  
       ( SELECT Age
               , SUM(Total) OVER (ORDER BY Age) AS runsum
            , SUM(Total) OVER () AS total
        FROM Age_Exp ) AS d  
​WHERE runsum >= total / 2 ORDER BY Age;

​This query is short and sweet and you will also notice it brings in the running total we discussed recently(video). The goal of the medium is to find the number halfway through all the data. So that means you need to know the total and find the number halfway through.

The way this works is by using the rolling sum you can figure out where the half-way point will end up because the running total will sum of the total value.

Then using the other analytic function that gets the sum you can divide it by half and only select the top value. The median will be the top value.
​
We would love to know your thoughts on this problem. Also, if you have any problems you would like to discuss then please feel free to contact us! Thanks for reading.
Read more about data science, healthcare analytics and automation below:
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
Time Series Model Development: Basic Models
How To Measure The Accuracy Of Predictive Models
What is A Decision Tree
0 Comments

Automation Best Practices

1/20/2019

0 Comments

 
by Ben Rogojan
Picture
Automation isn’t always about automated cars and drones that will deliver our purchases to our doorsteps. The goal of automation is to make people’s lives easier and not have them come in on work on Saturdays. There is a lot of tasks in the workplace that still can be automated to avoid having late nights and long weekends stuck at work. This doesn’t mean that automation in itself is easy to implement. Developing automated systems is a discipline, it requires patience, robust design and clearly thinking through the process.
​
There are good practices that can be overlooked when implementing an automated systems that can create a lot of tech debt. This will weigh down your system and possibly just waste the time for future developers who either have maintain or recreate the system. In order to avoid this an automated system should have solid error tracking, QA and be easy to maintain. Here are a few great tips for developing an automated system.

Error And Logging Tracking
Logging and error tracking does not come across as exciting work. Even developing basic tracking functionality can seem tedious. However, it is a necessary step for developing an automated system. Logging and error tracking can take place in several layers.

The simplest form is to track in raw files everything that occurs. This has the least overhead because it doesn’t need to be stored in a specific data system. This allows for low-level maintenance and future data parsing if required. Creating standardized logs allows for large amounts of unstructured data to be stored outside of a database and with modern technology and data storage systems searching through these files is much faster.
Having higher level logs with less information stored in standard data warehouse allows for tracking of issues and unhealthy systems. This can be used to detect issues such as slow load times, persistent bugs, etc. Unlike unstructured data, storing high-level issues in a data warehouse allows for easy views into trends and allows future analysis.

Especially as more complex automated systems come to fruition, being able to track and analyze errors and trends in usage will be a game changer. Whether this is for automated cars, e-commerce, and the world of IOT devices. The only way to manage the sheer volume of transactions and communication protocols between the various systems will be with automation. Error tracking and logging can’t be an afterthought in this automated systems. Otherwise, we will never be able to manage the holes in the systems we create.

QA And Alerting
Error logging is for issues that occur after release. The hope is we squash as many bugs before they occur. The problem is that automation leaves a lot of trust in the programmers and developers that set up the systems. Even trying to think through all the possible ways automated systems can go wrong, there are always failures and making sure to develop a QA suite that sweeps through the data and logs to make sure whatever automated processes are occurring, are occurring correctly.
​
There are basic QA tests that just make sure systems are running and then there are more complex systems that can take more of a machine learning approach. Looking for patterns and data profiles that could speak to process issues. For instance, sometimes null values are normal but sometimes they occur because of a specific issue. This could be because of an outage in a particular region, an update that works well on one OS and not another, and various other issues that might not be detected by standard integration and unit tests.

Thus, going more in depth and developing a QA system that can do basic tests just to ensure systems are functioning as well as test for more complex issues. Again, this is somewhere we would agree in implementing a form of a machine learning/data science system that can profile data to detect non-specific issues.

Simple Vs. Complex
One of the reasons for this comic above is the fact that automation, when poorly designed quickly, becomes a nightmare. Anti-patterns, over customizability, and poor design in general makes code difficult to maintain. This will take more time to manage instead of saving time. Instead of dealing with a hands off system you are managing a tech debt-ridden monstrosity that sucks up developer time.
​
If anyone recalls the Rube Goldberg machines we developed in middle school or elementary school you will recall the fact that in order to do simple tasks you over complicated and designed some system. This is a great example of some automated systems. In some cases, the task you might have been attempting to automate could have been done simpler or the system you developed is just too complex. Over-engineering can be a serious problem that makes it very difficult for future engineers to maintain. Over-abstracting code, developing too many functions, too many plausible paths can make code unmaintainable and untestable.

This often won’t be apparent until the developer that wrote the code leaves and the other programmers need to pick up where they left off.

This means good design, that doesn’t try to be too smart and over-engineered. Most professions suffer from this issue. Chefs try to put too much on one dish or an architects try to put too many features into a building. At the end of the day, the best Chefs, Architects and engineers understand that sometimes simple exceeds elegant. This doesn’t mean that concepts like abstraction and object-oriented are bad. It’s just that some developers place these concepts over the maintainable code. There is always a balance.

Dependency Management
Automated systems and processes have a lot of moving parts. This means at some point some components will need to occur in sequence and some will need to wait until multiple dependencies have completed in order to ensure the safety and health of the system. As more and more automated systems get developed to interact with people, beyond just robot arms behind safety walls and yellow tape, their systems will have to have multiple dependencies to ensure that the tasks they are performing are both correct and safe.

No matter if the system is interacting with people or if it is operating in the confines of a computer, having simplistic dependency management that allows for the developers to easily draw the paths makes it much easier to ensure that the system won’t run when it shouldn’t.

Automation does make life simpler when it is developed purposefully and fully thought through. This means planning how the error will be tracked and how the system will manage and alert larger problems. Designing a robust and maintainable system is not easy. However, when correctly implemented it can improve your team’s productivity and simplify your life.

If instead, your automation project goes a bit off the rails, then please feel free to contact our team!

Interested In Reading More About Being A Better Data Scientist?
How To Grow As A Data Scientist
Boosting Bagging And Building Better Algorithms
How To Survive Corporate Politics As A Data Scientist
8 Top Python Libraries For Machine Learning
What Is A Decision Tree
4 Ways To Calculate A Rolling Sum With SQL
0 Comments

Problem Solving With SQL

1/10/2019

1 Comment

 
SQL is the language of data. It can be used to answer countless questions in any company (unless that company still runs 100% off mainframes).
The ubiquity of data makes SQL one of the most utilized technical skills whether you are a data engineer, analyst, or data scientist.

Even as companies that rely on larger data sets that are in data systems and tools like Hadoop, Hive, and Presto. SQL remains an important. It is one of the few “languages” you can learn once and only have to make minor syntax changes as you switch data storage systems. SQL can seem limited when you first start out because you haven’t approached a diverse set of questions. Once you get past JOINs, Loops and some other basic concepts in SQL like subqueries analytic functions, and aggregation, it can seem like you understand everything you will need to know.

In an odd way, you are right. You know all the basic building blocks to do a lot of complex work. But just because you know the building blocks doesn’t mean you know all the applications. It is like saying you understand addition, multiplication, division, and subtraction thus you must also understand how algebra works.

Algebra uses all the basic concepts but applies them in a new way. In the same way, basic SQL concepts can be applied to complex problems, but it doesn’t mean you know how.
How these problems are solved isn’t always consistent. How you may answer a question in Oracle and Mysql could be different in Redshift or SQL Server. These problems can be analytical, performance and optimization or even design.

Just because you are using many of the same commands does not mean you are using the same approach.

Let’s talk about some of these problems.

Analytic Functions
Before we start, we wanted to talk about a useful tool to help solve a lot of problems.

These are called analytic functions. These might not be the first things you learn, but once you do, you will probably start using them everywhere. If you don’t already have a decent grasp of them then we recommend you get a quick intro.

One quick note: Not all SQL engines have analytic functions built in so you will have to use work arounds. We will discuss a few below.

Calculating a rolling sumLet’s say you want to calculate a rolling sum for the profits of a food truck on a daily basis. For now we won’t complicate it by adding in month or year. Simply a rolling sum for all time.

How would you do that with SQL? Any ideas?

Well, there is actually a really simple way to do this with analytic functions.


Select
    SUM(Profit) over (partition by food_truck_id order by day)  
   ,food_truck_id ,day  
FROM f_daily_sales

That is it…you’re done!
Wow! That was really simple. Using this quick script you were able to calculate the rolling sum for all the food trucks. That was almost too easy!

Now, in case you don’t know, using this style of “function() over (partition by x order by y) “ clause is what is known as either an analytic or window function. It allows the end user to do something similar to a group by, but instead of grouping under a specific set of fields, it partitions by that set of fields and then applies a function using the order by. This function can be a row_number, rank, sum, avg, etc. All of these functions essentially work in a rolling fashion based on the order by statement. Row_number will add a row number for the group of fields and restart at one once the group of fields is different.

As we said, analytic functions are very useful. So useful that sometimes interviewers don’t like you using them. So let’s say you’re in an interview and the interviewer says that they want you to resolve this problem without the analytic function…

Did your stomach just drop? Maybe this was the only way you knew how to solve this problem. Take a moment and think about how you could solve this problem with the basic SQL tools you have. Think about what the SQL statement is actually asking.

SELECT
   day ,SUM(Profit)
FROM  
      (
      Select  
           t1.day
         ,t2.Profit
     FROM f_daily_sales t1
     JOIN f_daily_sales t2
   On t1.day >=t2.day
   ) t3
Group by
   day
​
In this case you can use the self join with a greater than or equal to statement creates a table like the one below
Picture
As you can see you will be able to group on the t1.day field and get a rolling sum effect. One of the major downsides depends on the amount of data. Because depending on how the data is structured and how far back you would like to go, this method could cause a system to run out of memory.
If by any chance this is a problem. Here is another thought. You could create a summary table that basically aggregates the total per day.

This depends on you are partitioning the rolling sum. For instance, this will work well as long as you know you will constantly be reporting the rolling sum on a monthly and yearly basis. This would lock your design in, but add both speed and a reduction of the computer when you actually query. Again this is more of a design question. What is the right answer depending on your overall goals?

Joining consecutive events
Another question you might be asked is to provide insight into consecutive events for patients, customers, or other entities. For instance, you might want to calculate the number of days between the time a customer visits a store, or your website. Maybe you want a simple flag that dictates whether or not a patient is readmitted in x days after each their initial visit. Well at a basic level, these have a similar goal.

You need to know how many days are between two different transactions.

Most transactions are lined up row by row, so you can’t simply go to the next row to figure out this information.
This is where an analytic function can work again. Using the row number function you are able to partition the data by each customer and patient and add a row number to each event. What would like below?

Using this as your base data you can self join this table to the next row using a +1 on the row number row to join on the following row.

Select  
    Customer_ID
   ,row_number() over (partition by Customer_ID order by date) r_num  
   ,date ,total_paid
Into #Customer_Transactions  
From Customer_Transactions


Select
     t1.Customer_ID
    ,datediff(dd,t1.date ,t2.date) --this will calculate  the difference in days
From #Customer_Transactions t1
Join #Customer_Transactions t2
on t1.Customer_ID = t2.Customer_ID
​And t1.r_num = t2.r_num+1

With this self-joined the table and each row can be looked at side by side you can tell more information. You can answer questions like, how many days were in between each event on average, or how many times did customers return in less than 15 days.

This can be a great starting point into some analysis focused on retention or policy improvement. You can start to see where your problem areas are, and possibly notice trends between stores or doctors based on they perform.

Another approach that some people like to take is looping through the data. This can get a little more complex and is usually slower, but it is another solution we won’t go over in this post.

These were two examples of solving SQL problems. We will continue to post various problems and different ways you can solve them. We might even re-approach the same problems above. That is what is great about SQL, depending on how your data is laid out and your limitations force you to use different solutions. You could have worked somewhere with SQL Server and get accustomed to temporary tables that allow you to put indexes on them. Then you go to the database that no longer allows temp tables and your performance goes way down.

We would love to come and help increase your teams analytical and SQL abilities. Please feel free to reach out! Also, please let us know if you have any questions or problems you are working through. We would love to poke around on new problems!

Interested In Reading More About Being A Better Data Scientist?
How To Grow As A Data Scientist
Boosting Bagging And Building Better Algorithms
How To Survive Corporate Politics As A Data Scientist
8 Top Python Libraries For Machine Learning
What Is A Decision Tree

1 Comment

Where Can We Reduce Medical Spending With Automation

12/29/2018

0 Comments

 

During our years working for hospitals and insurance providers, there were many similar issues that caused unnecessarily increased costs. Healthcare providers, like many other industries, are facing drastically increased costsand decreased margins. Unlike tech companies that had the ability to develop technology to help their basic operations automate and scale, hospitals and healthcare providers have not. This has caused multiple administrative and analytical costs to grow year over year because of the lack of automation and process improvement. Some of the biggest costs that we have seen heavily weigh on healthcare providers are billing and financial processes, fraud detection and third-party contract management. Automation and process improvements are needed in these three areas if healthcare providers want to start reducing some of their biggest costs.


Billing And Financial Analysis
Managing billing and financial analysis can be very tedious tasks that require a combination of accounting discipline and entrepreneurial spirit. They are necessary practices that need to happen to manage expenses and revenue.

The problem is that even in billion dollar healthcare organizations lack automated billing and financial processes. Instead, hundreds of thousands to millions of dollars are often manually managed in excel spreadsheets. The processes to get the data and then slice and dice wastes analysts time. There are lots of ways that these financial tasks can start to weigh on the financial teams as they are not scalable. As hospitals and insurance providers merge and grow the problem amplifies and becomes even more difficult to manually manage. Typically, hospitals will approach this problem by increasing their staff rather than create a process or system that can continue to scale basic financial analytics and billing. This leads to increased operational costs that are difficult to manage without laying off employees(this isn’t even considering the costs of hiring, firing, rehiring…).

The solution is automation(mic drop….).

Okay, automation is easier said than done.

It requires a combination of buy-in from multiple stakeholders to financially back the projects and trust from directors and manager who read the reports and output of these automated systems. However, when executed well, these systems can reduce hundreds of hours of manual work.
​
It seems complicated, but often times automation, when done well really isn’t. Many times it actually simplifies the overall workload on your analysts. There might need to be some upskilling in basic SQL. However, we are living in a data driven world and SQL is the data language(even NOSQL databases have SQL layers because that is what we humans understand).

If your team doesn’t have that the skill set to build the tools themselves, then find it. Automation consultants and engineers can be found both internally and externally. Once the main systems are built it can be maintained by people with a lesser knowledge in programming and automation. Automation will help provide consistency, reduce waste and allow your analysts to focus on more important work.

Fraud Detection
Fraud detection and adjudication are necessary practices of insurance providers. This is because healthcare fraud costs billions of a dollars a year. it comes from patients and also healthcare providers that charge millions of dollars of upcoded and wasteful procedure purely to bolster their bottom line. In fact, there are even consultants and websites that specialize in helping healthcare providers bill creatively and max out their claims. These practices don’t increase costs for the insurance provider, they increase cost for the patients at the end of the day.

Insurance providers have to pay analysts to spend the time to manually go through claims to look for possible patterns of abuse. If you notice, the same problem in billing is occurring here. The problem is the manual step. Healthcare billing and claims processing have gotten too big to effectively handle manually. The solution of manually processing claims just don’t scale. This is where the tools of automation, big data, data warehousing, and analytics work well. They allow a specialist to create systems to munge the data effectively and scale even as the data grows.

Currently, most insurance providers have teams for fraud detection but they often can only go through a small percentage of claims( even billion dollar companies usually only go through about 5–10% of claims manually).

Sometimes they will even hire outside firms to again manually go through and look for low hanging easy fruit. For instance, billing consultants found that 78% of 99215 codes in Wisconsin (highest level established patient office visit) were incorrectly used. This is a very easy issue to spot and it becomes very expensive very fast.
​
All of this is usually limited to manual processes. Getting data pulls from databases into excel and then slicing and dicing the output. The beauty of automated methods is that they can quickly reduce the number of claims required to manually process.

​
What is even better, once you have created a system, you can replicate the results at a regular cadence. Even as analysts leave and new hires take their place, it is much easier to train them to understand the results of a system that works rather than have them have to relearn what to look for in the raw data. This increases the number of claims an insurance provider adjudicates while at the same time improving the efficiency of the process.

Many think you need complex machine learning algorithms to detect fraud when you first start developing models and your data isn’t even classified yet. However, when starting out, it is important to focus on cutting the number of targets claims your analysts will look at by 60,70 or 80%. This doesn’t take a complex algorithm. It requires developing basic business rules that can help sort through the false positives.

After properly managing claims and tracking which ones are fraud and which aren’t then it is easier to develop a machine learning model because your data is classified(fraud, not fraud).

Fraud detection and adjudication is a slow and costly process for many healthcare providers. Automation and well-executed analytics offer thousands to hundreds of thousands of dollars of saving.

Third Parties And Contract Management
One of the ways to get some systems automated and integrated is to use third-parties. Hospitals, insurance providers and other healthcare institutions are not tech companies. They don’t focus on developing technical tools to help in their day to day operations and this is ok! What is not ok is constantly signing contracts with new third parties for redundant features. For instance, paying for multiple data visualization tools is pointless like Tableau, Quilk, and OBIEE all provide report features and yet some companies use all three and more. Similar things can be said about relying on multiple financial systems.

Having redundant third-party contracts causes several issues. One, and most obvious is uncontrolled operational costs. Besides the upfront costs of signing multiple contracts it also requires more employee to manage the systems, manage the contracts and deal with the billing. These contracts are also difficult to break without paying a hefty fee for breaking the contract early. Thus, it is very important to reduce redundancy because the costs will be more than the cost of the contract.

The second issues are that these systems don’t have easy ways to access the data behind them. Lacking access to your own data is a problem. It is difficult to make good financial decisions if you don’t know what is happening in your institution. When you rely on third-party software, you need to be aware of the terms of your contracts and the features of the software. Otherwise, your expenses will grow from nowhere and you won’t be able to make good decisions.
​
Administrative costs like billing, fraud detection, and contract management are driving up health care costs for insurance providers, healthcare systems and patients. With a little bit of automation and process management, many of these costs can start to be mitigated. Automating processes like billing and financial analysis can be done with a combination of SQL and python. Contract management requires a combination of process improvement and analysis of the previous contracts. In the end, it can all lead to scalable cost savings that can occur systematically and don’t constantly need manual intervention.

Does your team need help automating your data systems and analytics? Then please contact us today!

Contact Us!

Read more about data science, healthcare analytics and automation below:
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
0 Comments

Reducing Healthcare Costs With Data Analytics

12/27/2018

0 Comments

 
​In 2018 it is estimated the US spent 3.5 trillion dollars on healthcare. A cost that is growing at nearly 5% a year. Rapidly growing pharmaceutical and administrative costs are just some of the driving factors behind this growth.

​This leaves the question, how are we going to reduce healthcare costs? We can’t ignore the problem. Otherwise, we will be spending way more than $15 for a Tylenol in 20 years.

Picture

The first part starts with recognizing some of the largest problems in healthcare. We pointed these out in an infographic here(Fraud, waste, and administrative costs). There are nearly a trillion dollars of spending that could be improved if the correct processes and policies are put into place. However, simply knowing what the problems are is not enough. We need to know where and why they are occurring. This brings us to our first step in starting a healthcare analytics project.
​
Picture

​Develop Accessible and Accurate Data Systems

In order to find out where the problems are, you need data that accurately represents what is occurring in your hospital/healthcare system. Relying purely on anecdotal evidence is no longer sufficient. It’s great to start and can guide data scientists and engineers to develop to correct data warehouses and dashboards. All of which requires quality data that is accessible to the correct users.

Developing a data system (ETLs, data warehouses, etc) that is both accurate and is accessible to the right people is a necessary step in confirming what the biggest cost-saving opportunities are.
This means thinking about what data you have, what data you need, who needs to access it and what questions are you hoping to answer. Not all of these can be answered right away. So it would be unwise to start by developing data warehouses that heavily adulterated from their original formats. That is why most data teams develop systems with the same basic pattern(example below).

You start with raw flat files that get ingested into a raw database. That data is run through a QA process just to double check that the data is clean and makes sense. One the data is QAed it is processed into a set of staging tables with minimal business logic added. Typically, this is more of a mapping phase. Anything that needs to be standardized is, data that is duplicated is cleaned up and data normalization may occur. The goal here is to limit the amount of business logic and focus more on creating an accurate picture of what the data represents. Whether this is procedures done on patients, supplies used by a hospital or computers and devices used by employees. This data is just an abstract depiction of what is occurring at the healthcare provider.

Once the data is loaded successfully and again QAed (yes, QA should happen any time data has an opportunity to be adulterated) then it can be loaded into the tables that everyone that should have access to them can have access to them.

This base data layer will provide your data scientist the ability to create analytical layers that sit on top and populate various aggregate and metric tables.
​
This basic outline is tried and true. More than likely, your data warehouse team is already supporting a system similar to the one we described. Typically the problem lies in accessibility and silos. The data might be produced for various teams like finance, or IT. However, they might all exist in separate systems with different keys and structures. This makes it difficult to find value.
Picture
At this point, there needs to be a much larger initiative to provide the correct accessibility as well as clear connections between the data systems. This is why using different keys for the same data across systems is a bad idea. Going into this would require a whole other set of blog posts and discussions so we will leave this with the basic data system.

Define Problems Clearly
As data engineers and data scientists we often learn about the subject matter we are working on through osmosis. We have to understand what the data represents to a point. However, it is difficult for us to know all the nuances of a specific subject and to understand all the context around the problems we are trying to solve.

That is why it is important to try to clearly state the questions you are trying to answer as well as provide as much context as possible. It gives data professionals a better picture of the problem they are trying to solve. The better they understand the problem, the clearer the solution becomes.

Otherwise, they could spend hours going in circles or answering the wrong questions. This is actually pretty common because sometimes what a stakeholder says is understood completely differently by a data professional. Especially if they don’t know why. Then they have to come up with their own why that drives their analysis. That means they could provide the wrong support, or answers at the wrong granularity.

Create Concise Metrics
With all the modern verbiage floating around it can be tempting to attempt to create algorithms and metrics that require too much adulteration of the original data and the value it could offer.

Metrics that are concisely stated are also more easily understood by everyone. This allows managers to make better decisions because they understand what the metrics abstractly represent. Rather than having to struggle to get how some random ratio or calculated value means they should target customer x or y.

This starts with a well-defined population. Whether that population is procedures, people, transactions, etc. It represents a specific group of entities. It is not always wise to look at an entire population first. Larger populations are more likely to have confounding factors that are harder to eliminate with simpler metrics. Developing metrics focused on specific populations to start also makes it easier to notice patterns in larger groups.

Review The Outcome
Analyzing the outcome and trends in the metrics that are developed can help drive new insights and policies. This requires that the outcomes are actually studied. Not just when the results are initially released but on a regular cadence. Typically, the first few times the metrics are examined there can be immediate benefits as policies are changed, entities are reached out to (like in the case of fraud) and hopefully cost savings are found.

After the initial release of the dashboards, there needs to be a plan on how often the results will be reviewed. To often will cause unnecessary actions to be made before the previous ones have impacted the outcomes and not often enough (or none at all) will waste both the dashboard and the time spent developing it.

Make a plan with the necessary team members (usually a director or manager, some subject matter experts for context and one of the data team members). Having this mix will provide the correct amount of perspective while informing the director and data team member of any needs required from the subject matter team. The data team might need to update the metrics based on the findings
Present The Results SimplyData, programming, and algorithms can all get very complicated. As data scientists and engineers we focus on the problem for so long we can begin to think everyone understands the problem as well as we do. Yet, this is not usually the case.

That is why we need to take a step back from our work and attempt to view our research, models, and analytics from the perspective of a teammate who hasn’t stared at the same problem for the past three months. It can be tempting to put all the graphs and logic that were analyzed over the past three months into a presentation or report. But this is likely to confuse a stakeholder who hasn’t been involved in your healthcare analytics project.
 
Usually, all that is required is a few key numbers to drive decisions. When you present too many numbers you risk burying the lead behind. It is important to focus on the key facts and figures first and if support is needed to provide it in a follow-up. This can be hard to do because it feels like we as engineers didn’t do a lot of work when we only show such a small amount of work. However, the key here is an impact, not showing off.
​
Quality data and analytics helps target and track savings opportunities in healthcare. When you start with accurate data and then develop concise metrics your team has the chance to make informed decisions on policies that can positively impact patients lives and at the end of the day that should theoretically be the goal.

We hope this helps you in your healthcare analytics project. Please feel free to reach out to our team if we can go in depth on any point! Our consulting team would be happy to help.

To read more data science and analytic articles see the below!
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
0 Comments

Reducing Healthcare Costs In 2019

12/22/2018

0 Comments

 
One of the few things democrats and Donald trump might agree on is reducing pharmaceutical costs.

Why? Healthcare costs continue to rise every year. Increasing pharmaceutical costs, technology improvements and wasteful procedures are costing the US more every year.

This increase in healthcare costs is starting to be felt by everyone. Some can’t afford it, while others struggle to keep up with their medical bills. Those of you who are younger might not understand now, but in 30 years when you need to go to the hospital in 2048 and you’re charged $200 dollars for a single Tylenol pill it might make sense. You might understand how ridiculous healthcare costs have become if we don’t start fixing problems now.
​
Why are healthcare costs continually going up?

There are really many reasons driving up medical costs. With many causes comes many solutions to improve the situation. However, let’s start by defining some of the biggest problems from a financial perspective. Below is an infographic of areas where there are massive opportunities to reduce healthcare costs. There are billions of dollars spent that provide opportunities to reduce the overall US healthcare bill.
Picture

How Do You Reduce Healthcare Costs?

Reducing healthcare costs starts with recognizing the biggest problems that will have the biggest impact when solved. Problems like fraud, waste and inefficient systems are some of the largest costs on the healthcare systems.

How do you recognize the biggest costs? Well, that is where data comes in! When we examine the successes of some of the large corporations today like Amazon, one of the differences they have between less data driven companies is they avoid terms like “I think” without having the data to back it up.
​
Our team will be posting on how you can reduce your healthcare costs using data in the next few days. If you have any questions on the subject matter before then, please reach out!

​Read More On Data Analytics Below!

How To Develop A Successful Healthcare Analytics Product
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists

0 Comments

Improving Patient Flows With Data Science And Analytics

12/9/2018

0 Comments

 

Our team was recently asked how data analytics and data science can be used to improve bottlenecks and patient flows in hospitals. Healthcare providers and hospitals can have very complex patient flows. Many steps can intertwine, resources have to shift in between tasks all the time, and severity of patients and new patients push the order of who needs to be seen all the time. This does not make it easy to approach process improvement in a hospital. This problem is a process problem. Something industrial engineers and six sigma practitioners love. They love looking at healthcare process problems in Excel sheets with thousands of tabs and thousands of rows of data. However, now we no longer are limited to doing our analysis and model development in Excel spreadsheets on thousands of rows of data. We now access to more complete data sets that in our experience can be in upwards of the billions of rows and more powerful computational systems we can analyze patient flows and bottlenecks much accurately and effectively.

Now with these tools like SQL, R and python we can analyze these data sets quickly.
It’s not just about the tools. In fact, with such powerful tools it can be tempting to try to make models and algorithms that can solve all the problems in one go. One of the big issues with this approach when looking at patient flows and bottlenecks in hospitals ( or really any problem) is it is far too general of an angle. It makes it very difficult to assess when an analysis is finished and often keeps the data scientists and analysts spinning for weeks without getting a real answer.

The problem here is the scope of looking at everything is very difficult to manage and pinpoint issues. Instead of trying to attack all the processes and procedures a hospital has. It is a better idea to break down several general categories of procedures/patient flows/processes that you believe are likely to have bottle necks. This is because hospitals have so many different possible paths and processes (I am going to use the word process to describe the patient flow below) that blindly looking for some sort of bottleneck will take forever (it is like looking for fraud in healthcare, if you try to do it too generally, then it will be near impossible to find).
​
The first step is to find out the problem areas. Without knowing what you want to target it can be very difficult to know what the solution is. In a perfect world your hospital has a database that tracks all the processes and procedures that are done. This will make it easy to develop a query or Jupyter notebook that can point out the main choke points. This will further help your team limit the amount of unnecessary work that is required. Once your team knows where the problems are, then there are low hanging fruit your teams can use to look for issues.

Abnormalities

Abnormalities, like inconsistent times for patient flows, whether that is specific doctor or in general can state that there is a problem. Finding these specific outliers can be quite simple.

For instance, let’s say you look for outliers in times for patient flow x and you hypothesize that specific days of the week, or times of the day are more likely to have longer times for certain steps. Then you pull out those steps and analyze time at a time granularity to flag each process individually. You might find that summer’s see decrease in productivity or that during the 4th of July your ERs overflow or perhaps some much less obvious data points.

One key point here, is that you come up with a theory first. Because having a clear question and hypothesis makes it much easier to look for evidence. With a clear question you know where to target further analysis. You can use a query to clean up your data and break it down to the granularity required. This might be on a hospital level, doctor level or maybe even down to the procedure level.
​
From there you could apply a basic algorithm that is good at highlighting outliers(like a basic IQR calculation or something more complex). Once you have found outliers then it can lead to further analysis into why there were longer times or inconsistent times in specific processes. There are many plausible causes, but now you have decided on a category of procedures, hypothesized and found a plausible weak point. Having these basic steps makes it much easier to move forward.
Following these steps you can repeat a similar process. Theorize why you are seeing the outliers, what might be causing it, and further research the data. This could be caused by bad processes, having too few staff during times of day when you need more people on hand (think queue theory). Once you know which steps to look at, you can start putting next steps in place such as process improvement teams who are now pinpointed towards the exact problem rather than simply sending in a team of analysts to follow around doctors and guess where the problem area is.

Chokepoints

Besides abnormalities, another common issue is that some processes might need the same resource. Now one way to locate these bottlenecks is based on the first point of abnormalities. As a bottleneck might be one of the problems causing the the abnormality. However, choke points might also hide themselves in the fact that the step in question always runs long and thus there is no abnormality. Instead, this analysis will require asking a simple question. Are there steps in patient flows that overlap and seem to take a long time or at least longer than expected. Certain steps might take a long time, like certain labs that take a while to run. There are others that shouldn’t. Analyzing these steps could lead to hospitals putting in new suites or hiring new specialists to deal with the heavy load in certain areas.

​Conclusion


​Improving patient flows is an important step to reducing patient costs and improving their satisfaction. By reducing the time they spend in the hospital and healthcare system you reduce the amount of hours required for staff to take care of them. This should help reduce the overall costs. Our team always looks as it as a reduction to the patient even though it should also in turn be a reduction in cost to the healthcare system. From our perspective, anything hospitals, insurance providers and consultants can do to help reduce healthcare costs in our current system need to be done.
How To Develop A Successful Healthcare Analytics Product
The Problems With Machine Learning In Healthcare
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
0 Comments

The Problem With Machine Learning In Healthcare

8/26/2018

0 Comments

 
By Benjamin Rogojan
​

Recently an article by the Wall Street Journal has been floating around online that discussed how models will run the world. I believe there is a lot of truth to that. Machine learning algorithms and models are becoming both ubiquitous and more trusted across industries. This, in turn will lead to us as humans spending less time questioning the output and simply allowing the system to tell us the answer. We already rely on companies like Google, Facebook and Amazon to inform us on ideas for dates, friends birthdays and what the best products are. Some of us don’t even think twice when it comes to the answers we receive from these companies.

As a data engineer who works in healthcare, this is both exciting, and terrifying. In the past year and a half, I have spent my time developing several products to help healthcare professionals make better decisions. Specifically targeting healthcare quality, fraud and drug misuse.

As I was working on the various metrics and algorithms I was constantly asking myself a few questions.

How will this influence the patient treatment?
How will it influence the doctor’s decision?
Will this improve the long-term health of a population?

In my mind, most hospitals are run like businesses, but there is some hope that their goal isn’t always the bottom line. My hope is that they are trying to serve their patients and communities first. If that is the truth, then the algorithms and models we build can’t just be focused on the bottom line (as many are in other industries). Instead, they really need to take a moment to consider how will this impact the patient, how will this impact their overall health, will this metric change the behavior of a doctor in a way that might be negative?
For instance, the Washington Health Alliance, which does a great job at reporting on various methods to help improve healthcare from a cost perspective as well as a care perspective wrote a report focused on improving healthcare cost by reducing wasteful procedures. That is a great idea!

In fact, I have worked on a similar project. That is when I started to think. What happens when some doctors take and over-adjust? I am sure many doctors will appropriately recalibrate their processes. However, what about the ones that try to over adjust?

What happens when some doctors try to correct their behavior too much and cause more harm than good because they don’t want to be flagged as wasteful.
​
Could we possibly cause doctors to miss out on obvious diagnoses because they are so concerned about costing the hospital and patients too much money? Or worse, perhaps they rely to strongly on their models to diagnosis for them in the future. I know I have over-adjusted my behaviors in the past when I was provided criticism, so what is to stop a doctor from doing the same? There is a fine line somewhere between allowing a human to make a good decision and forcing them to rely on the thinking of a machine(like Google Maps, how many of us actually remember to get anywhere).

There is a risk to focus more on the numbers and less on what the patients are actually saying. Doctors focusing too much on the numbers and not on the patient is a personal concern of mine. If a model is wrong for a company that is selling dress shirts or toasters that means missing out on a sale and missing a quarterly goal.
​
A model being wrong in healthcare could mean someone dies or isn’t properly treated. So as flashy as it can be to create systems that help us better make decisions. I do often wonder if we as humans have the discipline to not rely on them for the final say.

As healthcare professionals and data specialists, we have an obligation not just to help our company, but to consider the patient. We need to not just be data driven but be human-driven.

We might not be nurses and doctors but the tools we create now and in the future are going to be directly influencing nurse and doctor decisions and we need to take that into account. As data engineers, data scientists and machine learning engineers, we have the ability to make tools to amplify the abilities of the medical professionals we support and we can make a huge impact.
Because I agree, models will slowly start to run our world more and more (they already do for actions like trade, some medical diagnoses, purchasing at Amazon and more). But that means we need to think through all the operational scenarios. We need to consider all the possible outcomes both good and bad.

How To Develop A Successful Healthcare Analytics Product
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
0 Comments
<<Previous
Forward>>
    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners