SQL is the language of data. It can be used to answer countless questions in any company (unless that company still runs 100% off mainframes).
The ubiquity of data makes SQL one of the most utilized technical skills whether you are a data engineer, analyst, or data scientist.
Even as companies that rely on larger data sets that are in data systems and tools like Hadoop, Hive, and Presto. SQL remains an important. It is one of the few “languages” you can learn once and only have to make minor syntax changes as you switch data storage systems. SQL can seem limited when you first start out because you haven’t approached a diverse set of questions. Once you get past JOINs, Loops and some other basic concepts in SQL like subqueries analytic functions, and aggregation, it can seem like you understand everything you will need to know.
In an odd way, you are right. You know all the basic building blocks to do a lot of complex work. But just because you know the building blocks doesn’t mean you know all the applications. It is like saying you understand addition, multiplication, division, and subtraction thus you must also understand how algebra works.
Algebra uses all the basic concepts but applies them in a new way. In the same way, basic SQL concepts can be applied to complex problems, but it doesn’t mean you know how.
How these problems are solved isn’t always consistent. How you may answer a question in Oracle and Mysql could be different in Redshift or SQL Server. These problems can be analytical, performance and optimization or even design.
Just because you are using many of the same commands does not mean you are using the same approach.
Let’s talk about some of these problems.
Before we start, we wanted to talk about a useful tool to help solve a lot of problems.
These are called analytic functions. These might not be the first things you learn, but once you do, you will probably start using them everywhere. If you don’t already have a decent grasp of them then we recommend you get a quick intro.
One quick note: Not all SQL engines have analytic functions built in so you will have to use work arounds. We will discuss a few below.
Calculating a rolling sumLet’s say you want to calculate a rolling sum for the profits of a food truck on a daily basis. For now we won’t complicate it by adding in month or year. Simply a rolling sum for all time.
How would you do that with SQL? Any ideas?
Well, there is actually a really simple way to do this with analytic functions.
SUM(Profit) over (partition by food_truck_id order by day)
That is it…you’re done!
Wow! That was really simple. Using this quick script you were able to calculate the rolling sum for all the food trucks. That was almost too easy!
Now, in case you don’t know, using this style of “function() over (partition by x order by y) “ clause is what is known as either an analytic or window function. It allows the end user to do something similar to a group by, but instead of grouping under a specific set of fields, it partitions by that set of fields and then applies a function using the order by. This function can be a row_number, rank, sum, avg, etc. All of these functions essentially work in a rolling fashion based on the order by statement. Row_number will add a row number for the group of fields and restart at one once the group of fields is different.
As we said, analytic functions are very useful. So useful that sometimes interviewers don’t like you using them. So let’s say you’re in an interview and the interviewer says that they want you to resolve this problem without the analytic function…
Did your stomach just drop? Maybe this was the only way you knew how to solve this problem. Take a moment and think about how you could solve this problem with the basic SQL tools you have. Think about what the SQL statement is actually asking.
FROM f_daily_sales t1
JOIN f_daily_sales t2
On t1.day >=t2.day
In this case you can use the self join with a greater than or equal to statement creates a table like the one below
As you can see you will be able to group on the t1.day field and get a rolling sum effect. One of the major downsides depends on the amount of data. Because depending on how the data is structured and how far back you would like to go, this method could cause a system to run out of memory.
If by any chance this is a problem. Here is another thought. You could create a summary table that basically aggregates the total per day.
This depends on you are partitioning the rolling sum. For instance, this will work well as long as you know you will constantly be reporting the rolling sum on a monthly and yearly basis. This would lock your design in, but add both speed and a reduction of the computer when you actually query. Again this is more of a design question. What is the right answer depending on your overall goals?
Joining consecutive events
Another question you might be asked is to provide insight into consecutive events for patients, customers, or other entities. For instance, you might want to calculate the number of days between the time a customer visits a store, or your website. Maybe you want a simple flag that dictates whether or not a patient is readmitted in x days after each their initial visit. Well at a basic level, these have a similar goal.
You need to know how many days are between two different transactions.
Most transactions are lined up row by row, so you can’t simply go to the next row to figure out this information.
This is where an analytic function can work again. Using the row number function you are able to partition the data by each customer and patient and add a row number to each event. What would like below?
Using this as your base data you can self join this table to the next row using a +1 on the row number row to join on the following row.
,row_number() over (partition by Customer_ID order by date) r_num
,datediff(dd,t1.date ,t2.date) --this will calculate the difference in days
From #Customer_Transactions t1
Join #Customer_Transactions t2
on t1.Customer_ID = t2.Customer_ID
And t1.r_num = t2.r_num+1
With this self-joined the table and each row can be looked at side by side you can tell more information. You can answer questions like, how many days were in between each event on average, or how many times did customers return in less than 15 days.
This can be a great starting point into some analysis focused on retention or policy improvement. You can start to see where your problem areas are, and possibly notice trends between stores or doctors based on they perform.
Another approach that some people like to take is looping through the data. This can get a little more complex and is usually slower, but it is another solution we won’t go over in this post.
These were two examples of solving SQL problems. We will continue to post various problems and different ways you can solve them. We might even re-approach the same problems above. That is what is great about SQL, depending on how your data is laid out and your limitations force you to use different solutions. You could have worked somewhere with SQL Server and get accustomed to temporary tables that allow you to put indexes on them. Then you go to the database that no longer allows temp tables and your performance goes way down.
We would love to come and help increase your teams analytical and SQL abilities. Please feel free to reach out! Also, please let us know if you have any questions or problems you are working through. We would love to poke around on new problems!
Interested In Reading More About Being A Better Data Scientist?
How To Grow As A Data Scientist
Boosting Bagging And Building Better Algorithms
How To Survive Corporate Politics As A Data Scientist
8 Top Python Libraries For Machine Learning
What Is A Decision Tree
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!