Calculating the median has never seemed to be straight forward in SQL. Some SQL engines don’t even have a direct median function. Typically, you run a function on data that is raw so at least you know how the aggregated data looks.
However, let’s throw in a curveball. What if all the data that was already aggregated and you weren’t able to access the raw data. Instead, you only had a table that was high level.
For example, let’s say the data looked like the data below.
It might seem strange to have data stored this way, but there are cases where condensing the data in this fashion might make sense. Let’s say you’re team is storing billions of transactions per day and you want to run a median calculation over several years.
In this case, it makes more sense to keep the data aggregated because it will improve the performance. You might not even be able to properly store the transactions if it isn’t aggregated.
It could be tempting to “unroll” the data so to speak. To take the data and create a new table that has a row for each of the instances of a person at a specific age. This could cause your rows to grow significantly, to terabytes upon terabytes. Also is just another step that might not need to be done.
Instead, let’s look this query.
SELECT TOP(1) Age
( SELECT Age
, SUM(Total) OVER (ORDER BY Age) AS runsum
, SUM(Total) OVER () AS total
FROM Age_Exp ) AS d
WHERE runsum >= total / 2 ORDER BY Age;
This query is short and sweet and you will also notice it brings in the running total we discussed recently(video). The goal of the medium is to find the number halfway through all the data. So that means you need to know the total and find the number halfway through.
The way this works is by using the rolling sum you can figure out where the half-way point will end up because the running total will sum of the total value.
Then using the other analytic function that gets the sum you can divide it by half and only select the top value. The median will be the top value.
We would love to know your thoughts on this problem. Also, if you have any problems you would like to discuss then please feel free to contact us! Thanks for reading.
Read more about data science, healthcare analytics and automation below:
What is A Decision Tree
How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
Time Series Model Development: Basic Models
How To Measure The Accuracy Of Predictive Models
What is A Decision Tree
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!