Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

Problem Solving With SQL

1/10/2019

1 Comment

 
SQL is the language of data. It can be used to answer countless questions in any company (unless that company still runs 100% off mainframes).
The ubiquity of data makes SQL one of the most utilized technical skills whether you are a data engineer, analyst, or data scientist.

Even as companies that rely on larger data sets that are in data systems and tools like Hadoop, Hive, and Presto. SQL remains an important. It is one of the few “languages” you can learn once and only have to make minor syntax changes as you switch data storage systems. SQL can seem limited when you first start out because you haven’t approached a diverse set of questions. Once you get past JOINs, Loops and some other basic concepts in SQL like subqueries analytic functions, and aggregation, it can seem like you understand everything you will need to know.

In an odd way, you are right. You know all the basic building blocks to do a lot of complex work. But just because you know the building blocks doesn’t mean you know all the applications. It is like saying you understand addition, multiplication, division, and subtraction thus you must also understand how algebra works.

Algebra uses all the basic concepts but applies them in a new way. In the same way, basic SQL concepts can be applied to complex problems, but it doesn’t mean you know how.
How these problems are solved isn’t always consistent. How you may answer a question in Oracle and Mysql could be different in Redshift or SQL Server. These problems can be analytical, performance and optimization or even design.

Just because you are using many of the same commands does not mean you are using the same approach.

Let’s talk about some of these problems.

Analytic Functions
Before we start, we wanted to talk about a useful tool to help solve a lot of problems.

These are called analytic functions. These might not be the first things you learn, but once you do, you will probably start using them everywhere. If you don’t already have a decent grasp of them then we recommend you get a quick intro.

One quick note: Not all SQL engines have analytic functions built in so you will have to use work arounds. We will discuss a few below.

Calculating a rolling sumLet’s say you want to calculate a rolling sum for the profits of a food truck on a daily basis. For now we won’t complicate it by adding in month or year. Simply a rolling sum for all time.

How would you do that with SQL? Any ideas?

Well, there is actually a really simple way to do this with analytic functions.


Select
    SUM(Profit) over (partition by food_truck_id order by day)  
   ,food_truck_id ,day  
FROM f_daily_sales

That is it…you’re done!
Wow! That was really simple. Using this quick script you were able to calculate the rolling sum for all the food trucks. That was almost too easy!

Now, in case you don’t know, using this style of “function() over (partition by x order by y) “ clause is what is known as either an analytic or window function. It allows the end user to do something similar to a group by, but instead of grouping under a specific set of fields, it partitions by that set of fields and then applies a function using the order by. This function can be a row_number, rank, sum, avg, etc. All of these functions essentially work in a rolling fashion based on the order by statement. Row_number will add a row number for the group of fields and restart at one once the group of fields is different.

As we said, analytic functions are very useful. So useful that sometimes interviewers don’t like you using them. So let’s say you’re in an interview and the interviewer says that they want you to resolve this problem without the analytic function…

Did your stomach just drop? Maybe this was the only way you knew how to solve this problem. Take a moment and think about how you could solve this problem with the basic SQL tools you have. Think about what the SQL statement is actually asking.

SELECT
   day ,SUM(Profit)
FROM  
      (
      Select  
           t1.day
         ,t2.Profit
     FROM f_daily_sales t1
     JOIN f_daily_sales t2
   On t1.day >=t2.day
   ) t3
Group by
   day
​
In this case you can use the self join with a greater than or equal to statement creates a table like the one below
Picture
As you can see you will be able to group on the t1.day field and get a rolling sum effect. One of the major downsides depends on the amount of data. Because depending on how the data is structured and how far back you would like to go, this method could cause a system to run out of memory.
If by any chance this is a problem. Here is another thought. You could create a summary table that basically aggregates the total per day.

This depends on you are partitioning the rolling sum. For instance, this will work well as long as you know you will constantly be reporting the rolling sum on a monthly and yearly basis. This would lock your design in, but add both speed and a reduction of the computer when you actually query. Again this is more of a design question. What is the right answer depending on your overall goals?

Joining consecutive events
Another question you might be asked is to provide insight into consecutive events for patients, customers, or other entities. For instance, you might want to calculate the number of days between the time a customer visits a store, or your website. Maybe you want a simple flag that dictates whether or not a patient is readmitted in x days after each their initial visit. Well at a basic level, these have a similar goal.

You need to know how many days are between two different transactions.

Most transactions are lined up row by row, so you can’t simply go to the next row to figure out this information.
This is where an analytic function can work again. Using the row number function you are able to partition the data by each customer and patient and add a row number to each event. What would like below?

Using this as your base data you can self join this table to the next row using a +1 on the row number row to join on the following row.

Select  
    Customer_ID
   ,row_number() over (partition by Customer_ID order by date) r_num  
   ,date ,total_paid
Into #Customer_Transactions  
From Customer_Transactions


Select
     t1.Customer_ID
    ,datediff(dd,t1.date ,t2.date) --this will calculate  the difference in days
From #Customer_Transactions t1
Join #Customer_Transactions t2
on t1.Customer_ID = t2.Customer_ID
​And t1.r_num = t2.r_num+1

With this self-joined the table and each row can be looked at side by side you can tell more information. You can answer questions like, how many days were in between each event on average, or how many times did customers return in less than 15 days.

This can be a great starting point into some analysis focused on retention or policy improvement. You can start to see where your problem areas are, and possibly notice trends between stores or doctors based on they perform.

Another approach that some people like to take is looping through the data. This can get a little more complex and is usually slower, but it is another solution we won’t go over in this post.

These were two examples of solving SQL problems. We will continue to post various problems and different ways you can solve them. We might even re-approach the same problems above. That is what is great about SQL, depending on how your data is laid out and your limitations force you to use different solutions. You could have worked somewhere with SQL Server and get accustomed to temporary tables that allow you to put indexes on them. Then you go to the database that no longer allows temp tables and your performance goes way down.

We would love to come and help increase your teams analytical and SQL abilities. Please feel free to reach out! Also, please let us know if you have any questions or problems you are working through. We would love to poke around on new problems!

Interested In Reading More About Being A Better Data Scientist?
How To Grow As A Data Scientist
Boosting Bagging And Building Better Algorithms
How To Survive Corporate Politics As A Data Scientist
8 Top Python Libraries For Machine Learning
What Is A Decision Tree

1 Comment
D. Daniel link
11/6/2020 09:09:14 am

I have a table of prices by sporadic dates. e.g. 110.3 on 2020-03-15, 113.4 on 2020-03-20. I would like to produce a query over a given date range that would roll each price daily until the next date when the price is updated.
A generated date range can join to the price table ordered by dates descending and returning the first price where price date <= daily dates; is there a more efficient way using analytic functions?

Reply



Leave a Reply.

    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners