Calculating the median has never seemed to be straight forward in SQL. Some SQL engines don’t even have a direct median function. Typically, you run a function on data that is raw so at least you know how the aggregated data looks. However, let’s throw in a curveball. What if all the data that was already aggregated and you weren’t able to access the raw data. Instead, you only had a table that was high level. For example, let’s say the data looked like the data below. It might seem strange to have data stored this way, but there are cases where condensing the data in this fashion might make sense. Let’s say you’re team is storing billions of transactions per day and you want to run a median calculation over several years. In this case, it makes more sense to keep the data aggregated because it will improve the performance. You might not even be able to properly store the transactions if it isn’t aggregated. It could be tempting to “unroll” the data so to speak. To take the data and create a new table that has a row for each of the instances of a person at a specific age. This could cause your rows to grow significantly, to terabytes upon terabytes. Also is just another step that might not need to be done. Instead, let’s look this query. SELECT TOP(1) Age FROM ( SELECT Age , SUM(Total) OVER (ORDER BY Age) AS runsum , SUM(Total) OVER () AS total FROM Age_Exp ) AS d WHERE runsum >= total / 2 ORDER BY Age; This query is short and sweet and you will also notice it brings in the running total we discussed recently(video). The goal of the medium is to find the number halfway through all the data. So that means you need to know the total and find the number halfway through. The way this works is by using the rolling sum you can figure out where the half-way point will end up because the running total will sum of the total value. Then using the other analytic function that gets the sum you can divide it by half and only select the top value. The median will be the top value. We would love to know your thoughts on this problem. Also, if you have any problems you would like to discuss then please feel free to contact us! Thanks for reading. Read more about data science, healthcare analytics and automation below:
What is A Decision Tree How Algorithms Can Become Unethical and Biased How To Develop Robust Algorithms 4 Must Have Skills For Data Scientists Time Series Model Development: Basic Models How To Measure The Accuracy Of Predictive Models What is A Decision Tree
0 Comments
by Ben Rogojan Automation isn’t always about automated cars and drones that will deliver our purchases to our doorsteps. The goal of automation is to make people’s lives easier and not have them come in on work on Saturdays. There is a lot of tasks in the workplace that still can be automated to avoid having late nights and long weekends stuck at work. This doesn’t mean that automation in itself is easy to implement. Developing automated systems is a discipline, it requires patience, robust design and clearly thinking through the process. There are good practices that can be overlooked when implementing an automated systems that can create a lot of tech debt. This will weigh down your system and possibly just waste the time for future developers who either have maintain or recreate the system. In order to avoid this an automated system should have solid error tracking, QA and be easy to maintain. Here are a few great tips for developing an automated system. Error And Logging Tracking Logging and error tracking does not come across as exciting work. Even developing basic tracking functionality can seem tedious. However, it is a necessary step for developing an automated system. Logging and error tracking can take place in several layers. The simplest form is to track in raw files everything that occurs. This has the least overhead because it doesn’t need to be stored in a specific data system. This allows for low-level maintenance and future data parsing if required. Creating standardized logs allows for large amounts of unstructured data to be stored outside of a database and with modern technology and data storage systems searching through these files is much faster. Having higher level logs with less information stored in standard data warehouse allows for tracking of issues and unhealthy systems. This can be used to detect issues such as slow load times, persistent bugs, etc. Unlike unstructured data, storing high-level issues in a data warehouse allows for easy views into trends and allows future analysis. Especially as more complex automated systems come to fruition, being able to track and analyze errors and trends in usage will be a game changer. Whether this is for automated cars, e-commerce, and the world of IOT devices. The only way to manage the sheer volume of transactions and communication protocols between the various systems will be with automation. Error tracking and logging can’t be an afterthought in this automated systems. Otherwise, we will never be able to manage the holes in the systems we create. QA And Alerting Error logging is for issues that occur after release. The hope is we squash as many bugs before they occur. The problem is that automation leaves a lot of trust in the programmers and developers that set up the systems. Even trying to think through all the possible ways automated systems can go wrong, there are always failures and making sure to develop a QA suite that sweeps through the data and logs to make sure whatever automated processes are occurring, are occurring correctly. There are basic QA tests that just make sure systems are running and then there are more complex systems that can take more of a machine learning approach. Looking for patterns and data profiles that could speak to process issues. For instance, sometimes null values are normal but sometimes they occur because of a specific issue. This could be because of an outage in a particular region, an update that works well on one OS and not another, and various other issues that might not be detected by standard integration and unit tests. Thus, going more in depth and developing a QA system that can do basic tests just to ensure systems are functioning as well as test for more complex issues. Again, this is somewhere we would agree in implementing a form of a machine learning/data science system that can profile data to detect non-specific issues. Simple Vs. Complex One of the reasons for this comic above is the fact that automation, when poorly designed quickly, becomes a nightmare. Anti-patterns, over customizability, and poor design in general makes code difficult to maintain. This will take more time to manage instead of saving time. Instead of dealing with a hands off system you are managing a tech debt-ridden monstrosity that sucks up developer time.
If anyone recalls the Rube Goldberg machines we developed in middle school or elementary school you will recall the fact that in order to do simple tasks you over complicated and designed some system. This is a great example of some automated systems. In some cases, the task you might have been attempting to automate could have been done simpler or the system you developed is just too complex. Over-engineering can be a serious problem that makes it very difficult for future engineers to maintain. Over-abstracting code, developing too many functions, too many plausible paths can make code unmaintainable and untestable. This often won’t be apparent until the developer that wrote the code leaves and the other programmers need to pick up where they left off. This means good design, that doesn’t try to be too smart and over-engineered. Most professions suffer from this issue. Chefs try to put too much on one dish or an architects try to put too many features into a building. At the end of the day, the best Chefs, Architects and engineers understand that sometimes simple exceeds elegant. This doesn’t mean that concepts like abstraction and object-oriented are bad. It’s just that some developers place these concepts over the maintainable code. There is always a balance. Dependency Management Automated systems and processes have a lot of moving parts. This means at some point some components will need to occur in sequence and some will need to wait until multiple dependencies have completed in order to ensure the safety and health of the system. As more and more automated systems get developed to interact with people, beyond just robot arms behind safety walls and yellow tape, their systems will have to have multiple dependencies to ensure that the tasks they are performing are both correct and safe. No matter if the system is interacting with people or if it is operating in the confines of a computer, having simplistic dependency management that allows for the developers to easily draw the paths makes it much easier to ensure that the system won’t run when it shouldn’t. Automation does make life simpler when it is developed purposefully and fully thought through. This means planning how the error will be tracked and how the system will manage and alert larger problems. Designing a robust and maintainable system is not easy. However, when correctly implemented it can improve your team’s productivity and simplify your life. If instead, your automation project goes a bit off the rails, then please feel free to contact our team! Interested In Reading More About Being A Better Data Scientist? How To Grow As A Data Scientist Boosting Bagging And Building Better Algorithms How To Survive Corporate Politics As A Data Scientist 8 Top Python Libraries For Machine Learning What Is A Decision Tree 4 Ways To Calculate A Rolling Sum With SQL SQL is the language of data. It can be used to answer countless questions in any company (unless that company still runs 100% off mainframes). The ubiquity of data makes SQL one of the most utilized technical skills whether you are a data engineer, analyst, or data scientist. Even as companies that rely on larger data sets that are in data systems and tools like Hadoop, Hive, and Presto. SQL remains an important. It is one of the few “languages” you can learn once and only have to make minor syntax changes as you switch data storage systems. SQL can seem limited when you first start out because you haven’t approached a diverse set of questions. Once you get past JOINs, Loops and some other basic concepts in SQL like subqueries analytic functions, and aggregation, it can seem like you understand everything you will need to know. In an odd way, you are right. You know all the basic building blocks to do a lot of complex work. But just because you know the building blocks doesn’t mean you know all the applications. It is like saying you understand addition, multiplication, division, and subtraction thus you must also understand how algebra works. Algebra uses all the basic concepts but applies them in a new way. In the same way, basic SQL concepts can be applied to complex problems, but it doesn’t mean you know how. How these problems are solved isn’t always consistent. How you may answer a question in Oracle and Mysql could be different in Redshift or SQL Server. These problems can be analytical, performance and optimization or even design. Just because you are using many of the same commands does not mean you are using the same approach. Let’s talk about some of these problems. Analytic Functions Before we start, we wanted to talk about a useful tool to help solve a lot of problems. These are called analytic functions. These might not be the first things you learn, but once you do, you will probably start using them everywhere. If you don’t already have a decent grasp of them then we recommend you get a quick intro. One quick note: Not all SQL engines have analytic functions built in so you will have to use work arounds. We will discuss a few below. Calculating a rolling sumLet’s say you want to calculate a rolling sum for the profits of a food truck on a daily basis. For now we won’t complicate it by adding in month or year. Simply a rolling sum for all time. How would you do that with SQL? Any ideas? Well, there is actually a really simple way to do this with analytic functions. Select SUM(Profit) over (partition by food_truck_id order by day) ,food_truck_id ,day FROM f_daily_sales That is it…you’re done! Wow! That was really simple. Using this quick script you were able to calculate the rolling sum for all the food trucks. That was almost too easy! Now, in case you don’t know, using this style of “function() over (partition by x order by y) “ clause is what is known as either an analytic or window function. It allows the end user to do something similar to a group by, but instead of grouping under a specific set of fields, it partitions by that set of fields and then applies a function using the order by. This function can be a row_number, rank, sum, avg, etc. All of these functions essentially work in a rolling fashion based on the order by statement. Row_number will add a row number for the group of fields and restart at one once the group of fields is different. As we said, analytic functions are very useful. So useful that sometimes interviewers don’t like you using them. So let’s say you’re in an interview and the interviewer says that they want you to resolve this problem without the analytic function… Did your stomach just drop? Maybe this was the only way you knew how to solve this problem. Take a moment and think about how you could solve this problem with the basic SQL tools you have. Think about what the SQL statement is actually asking. SELECT day ,SUM(Profit) FROM ( Select t1.day ,t2.Profit FROM f_daily_sales t1 JOIN f_daily_sales t2 On t1.day >=t2.day ) t3 Group by day In this case you can use the self join with a greater than or equal to statement creates a table like the one below As you can see you will be able to group on the t1.day field and get a rolling sum effect. One of the major downsides depends on the amount of data. Because depending on how the data is structured and how far back you would like to go, this method could cause a system to run out of memory.
If by any chance this is a problem. Here is another thought. You could create a summary table that basically aggregates the total per day. This depends on you are partitioning the rolling sum. For instance, this will work well as long as you know you will constantly be reporting the rolling sum on a monthly and yearly basis. This would lock your design in, but add both speed and a reduction of the computer when you actually query. Again this is more of a design question. What is the right answer depending on your overall goals? Joining consecutive events Another question you might be asked is to provide insight into consecutive events for patients, customers, or other entities. For instance, you might want to calculate the number of days between the time a customer visits a store, or your website. Maybe you want a simple flag that dictates whether or not a patient is readmitted in x days after each their initial visit. Well at a basic level, these have a similar goal. You need to know how many days are between two different transactions. Most transactions are lined up row by row, so you can’t simply go to the next row to figure out this information. This is where an analytic function can work again. Using the row number function you are able to partition the data by each customer and patient and add a row number to each event. What would like below? Using this as your base data you can self join this table to the next row using a +1 on the row number row to join on the following row. Select Customer_ID ,row_number() over (partition by Customer_ID order by date) r_num ,date ,total_paid Into #Customer_Transactions From Customer_Transactions Select t1.Customer_ID ,datediff(dd,t1.date ,t2.date) --this will calculate the difference in days From #Customer_Transactions t1 Join #Customer_Transactions t2 on t1.Customer_ID = t2.Customer_ID And t1.r_num = t2.r_num+1 With this self-joined the table and each row can be looked at side by side you can tell more information. You can answer questions like, how many days were in between each event on average, or how many times did customers return in less than 15 days. This can be a great starting point into some analysis focused on retention or policy improvement. You can start to see where your problem areas are, and possibly notice trends between stores or doctors based on they perform. Another approach that some people like to take is looping through the data. This can get a little more complex and is usually slower, but it is another solution we won’t go over in this post. These were two examples of solving SQL problems. We will continue to post various problems and different ways you can solve them. We might even re-approach the same problems above. That is what is great about SQL, depending on how your data is laid out and your limitations force you to use different solutions. You could have worked somewhere with SQL Server and get accustomed to temporary tables that allow you to put indexes on them. Then you go to the database that no longer allows temp tables and your performance goes way down. We would love to come and help increase your teams analytical and SQL abilities. Please feel free to reach out! Also, please let us know if you have any questions or problems you are working through. We would love to poke around on new problems! Interested In Reading More About Being A Better Data Scientist? How To Grow As A Data Scientist Boosting Bagging And Building Better Algorithms How To Survive Corporate Politics As A Data Scientist 8 Top Python Libraries For Machine Learning What Is A Decision Tree |
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|