Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

Calculating A Bucketed Median With SQL

1/29/2019

0 Comments

 
Calculating the median has never seemed to be straight forward in SQL. Some SQL engines don’t even have a direct median function. Typically, you run a function on data that is raw so at least you know how the aggregated data looks.

However, let’s throw in a curveball. What if all the data that was already aggregated and you weren’t able to access the raw data. Instead, you only had a table that was high level.
​

For example, let’s say the data looked like the data below.
Picture
It might seem strange to have data stored this way, but there are cases where condensing the data in this fashion might make sense. Let’s say you’re team is storing billions of transactions per day and you want to run a median calculation over several years.

In this case, it makes more sense to keep the data aggregated because it will improve the performance. You might not even be able to properly store the transactions if it isn’t aggregated.
It could be tempting to “unroll” the data so to speak. To take the data and create a new table that has a row for each of the instances of a person at a specific age. This could cause your rows to grow significantly, to terabytes upon terabytes. Also is just another step that might not need to be done.
Instead, let’s look this query.

SELECT TOP(1) Age
   FROM  
       ( SELECT Age
               , SUM(Total) OVER (ORDER BY Age) AS runsum
            , SUM(Total) OVER () AS total
        FROM Age_Exp ) AS d  
​WHERE runsum >= total / 2 ORDER BY Age;

​This query is short and sweet and you will also notice it brings in the running total we discussed recently(video). The goal of the medium is to find the number halfway through all the data. So that means you need to know the total and find the number halfway through.

The way this works is by using the rolling sum you can figure out where the half-way point will end up because the running total will sum of the total value.

Then using the other analytic function that gets the sum you can divide it by half and only select the top value. The median will be the top value.
​
We would love to know your thoughts on this problem. Also, if you have any problems you would like to discuss then please feel free to contact us! Thanks for reading.
Read more about data science, healthcare analytics and automation below:
What is A Decision Tree
​How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
Time Series Model Development: Basic Models
How To Measure The Accuracy Of Predictive Models
What is A Decision Tree
0 Comments



Leave a Reply.

    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners