By Ben Rogojan
Data science is just one of the modern data driven fields in our new data world. Another job that is even more prevalent than data scientist is data engineer. Now being a data engineer does not have all the hype behind it of being a data scientist. However, companies like Google, Facebook, Amazon and Netflix all need great data engineers!
Data engineering requires a combination of knowledge from data warehousing, to programming in order to ensure the data systems are designed well and are as automated as possible.
The question is: How do you prepare for an interview for a data engineering position?
Many of the questions will require you to understand data warehouses , scripting, ETL development, and possibly some NO-SQL if the company uses a different form of data storage system like CouchDB.
In case you are preparing for a data engineering interview here are some questions that might help you. We are focusing on conceptual questions. However, you should also work on some technical skills like SQL, python, etc.
How would you approach developing a new analytical product as a data engineer?
As a data engineer you control what is possible in the final product. A data scientist can’t build algorithms or metrics without having the data at the right granularity and the right data.
This means a data engineer needs to understand the entire product. A data engineer can’t just get away with building systems based off requirements. They need to ask why they are building certain tables and objects.
If the stakeholders already have a general outline of what they want it is very helpful. If they don’t have one, we would want to work with them to develop a general idea of what metrics and algorithms will exist. This drives all the major decisions. What data should be pulled it, how long should it be stored, should it be archived, etc.
Once a general outline exists, then the next step would be drilling into the why of each metric. This is because as you’re building different tables at different data granularities, certain issues might arise. Should the unique keybe on columns A and B, or A, B and C. Well, that depends, why is this important? What does that row signify? Is it customer level, store level, maybe brand level?
Once your team has gone through the process of working on the outline with your stakeholders and gained an understanding of the why. The next step is to think through as many operational scenarios as possible.
Will you need to reload data ever? Do your ETLs allow for it? Is it efficient? What happens when X occurs? How do you handle case Y?
You can’t spend all day doing this, but trying to think of as many issues that could occur helps develop a more robust system. It also helps create a system that actually meet requirements.
From there, it’s about developing the design, creating test cases, testing the tables, stored procedures and scripts and then pushing to production. How that occurs usually depends team to team.
What is the difference between an operational database and a data warehouse?
If you took a database course in college you probably learned about how to set up a standard normalized database. This style of database is optimized for transactions that involve Insert, Update and Delete SQL statements. This is a standard operational databases. They need to focus on making transactions quickly without getting bogged down by calculations and data manipulations. Thus, their design is a little more cumbersome for an analysts. Generally you will have to join several tables just to get a single data point.
A data warehouse is not concerned as much with dealing with millions of fast transactions every second. Instead, a data warehouse is usually built to support a data analytics product and analysis. This means performance is not geared towards transactions, but aggregations, calculations and select statements. A data warehouse will have a slightly denormalized structure compared to an operational database. In most data warehouses a majority of tables will take on two different attributes, a historical transaction table and tables that contain categorical style data. We reference these as fact and dimension tables. The fact table is essentially in the center. Unlike in a normalized database where you might have to join across several tables to get one data point. A standard data warehouse usually has a focus on the fact tables and all the dimension tables join to provide categorical information to the fact table(it’s also typically bad practice to join fact table to fact table, but sometimes it can occur if the data is created correctly). Here is an example of the structure of a data warehouse.
These are not the only tables that exist in a data warehouse. There are aggregate tables, snapshots, partitions, and more. The goal is usually a report or dashboard that can be automatically updated quickly.
Data warehouses also have a lot of other nuances like slowly changing dimensions. However, that is a whole other can of worms.
Tell us about a time you had performance issues with an ETL and how did you fix it?
As a data engineer you will run into performance issues. Either you developed an ETL when the data was smaller and it didn’t scale or you’re maintaining older architecture that is not scaling. ETLs are built up of multiple components, multiple table inserts, merges, and updates. This makes it difficult to tell exactly where the ETL issue is occurring. The first step is identifying the problem, so you need to figure out where is the bottleneck occurring.
Hopefully..whoever set up your ETL has an ETL log table somewhere that tracks when components finish. This makes it easy to spot bottle necks and the biggest time sucks. If not, it will not be easy to find the issue. Depending on the urgency of the issue, we would recommend setting up an ETL log table and then rerunning to identify the issue. If the fix is needed right away, then you will probably just have to go piece by piece of the ETL to try to track down the long running component. This also depends on how long the ETL takes to run. There are ways you can approach that as well depending on the what component relies on what component
Issues vary wildly from table locks, slow transactions, loops getting stuck, etc.
Once you have identified the issue, then you need to figure out a fix. This depends on the problem but the solutions could require adding an index, removing an index, partitioning tables, batching the data in smaller pieces (or sometimes even larger pieces it seems counter intuitive but this would depend on table scans and indexes). Depending on the storage system you are using it is good to look into the activity monitor to see what is happening on an I/O level. This will give you a better idea of the problem.
When you look at the activity monitor you can see if there is any data being processed at all? Is there too much data being processed, none, table locks? Any of these issues can choke an ETL and would need to be addressed.
If you Google some of the performance issues then you will find some people blaming the architecture for a lot of the problems. We don’t disagree with them. However, this doesn’t mean you throw in the towel. There are always various ways to manage performance if you can’t touch the actual structure. Even beyond indexes there are some parallel processing methods that can be used to speed up the ETL or adding temporary support tables to lighten the load.
Question Number 4? Are you experienced in python, powershell, bash, and or Java?
This question is more to note that it is important to have a scripting language on your side as a data engineer. Typically, most shops we have worked in won’t just rely on ETL tools like SSIS. Instead, they will often use custom scripts and libraries to load data. It might seem like overkill, but it is often easier. Using tools like SSIS are great if you need all the fancy functions like sending emails and some of the other add-ons. Even these could be scripted instead of written.
So we do recommend having some scripting abilities as well. It allows you to easily automate data flow and analytical tasks.
These are just some general questions to help get ready for your data engineering interview. Besides these questions you should also look into the concept of slowly changing dimensions, automation with python or powershell, some basic linux commands and design concepts. If you do have a data engineer interview, then we wish you good luck!
Call To Action:
Our team has experience developing multiple types of data pipelines from start to finish. We have experience creating effective data architecture and thinking about the end use cases. If you need help building a data warehouse or an entire data analytics products please reach out!
How To Grow As A Data Scientist
How To Survive Corporate Politics As A Data Scientist
Boosting And Bagging: How To Develop More Robust Machine Learning Models
8 Top Python Libraries For Machine Learning
What Is A Decision Tree
Other Data Science Services Our Team Provides
By Ben Rogojan
Analytics and data science are important tools in healthcare. Not just because they might be able to predict when someone will have a heart attack. Good data science and analytics are important tools because they can help make better decisions when it comes to spending and reducing inefficiencies in healthcare. Being able to reduce the cost of healthcare for the providers would in turn allow them to allow more people access to healthcare.
The temptation healthcare has when looking at all the other industries that have been using analytics and algorithms for a long time like finance/banking and ecommerce is to try to take on all the data problems in healthcare in their entirety. This in turn leads to large investments in data teams, data lakes, and other data centric initiatives that don’t always succeed. Similar to one of the reason the DC movies are failing to capture audiences attention like Marvel. They rushed into trying to create something similar to what Marvel created in a decade in a third of the time. Similarly, many healthcare companies try to set up data science shops and attempt to replicate what Google, Amazon and some of the other big tech companies have been doing for more than a decade in a few years.
It takes time to set up analytical teams and data systems at that proficiency level (let’s not even get started on data labeling). Instead of trying to create an entire universe of data analytics and algorithms in 1 year. We have had success by picking one or two problems to start with (like selling books online but in healthcare), pick a point of view to approach the problem and just tell a simple story. If your data teams are able to do this and succeed, then you will be that much closer to success and have started to gain supporters and buy in from managers. Healthcare is complex, so start by focusing on a few problems, and a limited perspective before investing millions of dollars in a large team.
Pick One Problem To Take One
When it comes to healthcare analytics(or any analytics), it is important to pick a specific problem and then break that problem into further bite sized pieces. For example, if you are trying to develop an algorithm that detects fraud. It can be very difficult to design a model, algorithm or system to abstract the concept of fraud as a whole into a single mathematical expression. Especially when the healthcare interactions are rarely classified fraud or not fraud.
That doesn’t stop you from doing analytics, or predictive modeling. It just changes your approach. Instead of developing complex data science models, focus on what has a good ROI first. A basic system that can help first reduce the amount of claims analysts have to wade through for fraud or point out larger systemic problems at a higher level can help know where to first focus your efforts. In addition, they can be labeling and noting patterns through the interactions they get through.
After spending some time looking at those cases, your analytics teams will have a further understanding of the problem as well as a lot of great best practices and labeled data. In the end, this might seem slow, but slow is better than failing.
Focus On One Perspective
When developing analytics of any kind there are lots of different angles and points of view that you can take. For instance, in healthcare you could focus on the healthcare providers (hospitals, ERs, etc), patients or types of procedures. Your team might not have the resources to develop a tool, dashboard or algorithm to approach all these different angles right away. Thus, when taking on a new project like “predicting patient readmissions”, pick a category that will benefit your team the most. Often times focusing on the provider first is one of the better options. This is because it is much easier to change provider behavior compared to patient behavior. When you alter one providers behavior the impact it has is much larger compared to changing one patients behavior. Both of these are two reasons why often focusing on analytics on the provider level first can be helpful. In addition, like was discussed above, it could help focus and pinpoint some of the more specific problems at the patient level.
Your Final Product Should Have A Clear Story
Once you have developed a final data points of the product, that product needs to be able to tell a story. Honestly, this is still a difficult part for me. However, it is a key concept that makes a huge difference. Even if you’re an amazing programmer, data engineer, or algorithm developer, if your product doesn’t portray a clear story to your end-users, then it could get ignored or misinterpreted. Usually, the flow I have seen that is successful for reports is have a high level of what the problem is, maybe include the cost of the problem if that is applicable. This gets everyone on the same page and also catches their attention. If you can state how big the problem is and how much it is costing them, it will draw the end users in.
Then you can either further go into the problem and break it down into subsets and then go into the next steps.
The next steps don’t have to be a direct solution. It might be a list of procedures that are problematic and causing a high amount of readmissions or fraud. This leads to the next possible project which is analyzing the why. Once you have figured out the possible why and solution, then you can refer back to the report and see if things changed.
Caption: Taking a quick detour, lets look at a few dashboards from actual companies. The dashboard above is from Health Catalyst. The top portion is attempting to state whether or not the cohort is currently meeting the target for various readmissions. This kind of states if there is a problem or not. It is pretty typical for medical analytics to do a comparison approach. The middle section is a little bit of a disconnect for me. It doesn't really flow with the story they are trying to tell. It talks about interventions, but this doesn't really connect with the first set of graphs. However, the last section that is a bar graph melded with a line chart makes sense because it tells you the history. So maybe you're not hitting your target readmission numbers, but you are improving!
Overall, the way you approach healthcare analytics or developing algorithms to predict events like patient readmission is no different than any other industry. It can be tempting to try to take on entire problems like fraud, or patient readmission as a whole. Healthcare’s massive amount of data and complexity of transaction does not make this easy. In order to manage this complexity breaking down the problem and deciding the point of view you plan to take will help increase your success and the rate of results.
Does your team need help developing healthcare analytics? Perhaps you are looking to measure provider quality or analyze patient patterns. If so, then contact our team today! We have a team of healthcare analytical specialist who can help create the tools you need.
If you want to read more about data science or analytics check out the articles below:
How Men's Wearhouse Could Use Data Science
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
Let’s say a group of researchers, or data scientists discover that the mortality rate in Florida is 20 deaths out of 1000 people a year compared to Washington State where it is 9.8 deaths out of 1000 people.
Being very concerned these researchers put in a proposal for millions of dollars to try to figure out how to decrease this mortality rate in Florida. However, they forgot to dig a little deeper. They forgot to exam the average age of the populations of the two states. If Florida’s average age is 52 and Washington’s is 25 that might play a role in the mortality rate. In this case, the age played the role of a confounding variable.
A variable that is not considered but plays a role in the outcome of an event is considered a confounding factor.
In epidemiology, a confounding variable refers to a variable that is a risk factor for a disease or is associated to the exposure of the disease but is not the actual exposure. Confounding variables are not limited to a disease. When developing algorithms for product recommendations, A/B testing or market segmentation confounding variables can creep in and mislead data scientists and analysts looking to create effective algorithms.
This can lead to invalid conclusions and incorrect comparisons. For instance, in the case of death rates in Washington and Florida, you are comparing two different populations. This is more like comparing apples to oranges. Although in this case you are comparing people to people, the populations have different attributes that make them unfair to compare.
In order to mitigate this there are two specific options. One is to develop the populations into more focused categories like age. Another method that could be used to standardize measurements like death rates is to use a direct adjustment rate. Using a direct adjustment rate allows you to create a standardized
A Real Life Example Of Confounding VariablesLet’s suppose you want to compare an outcome in two populations. In this case, let’s say you are comparing the case fatality rates of a trauma hospital to that of a general hospital. This example is actually based off real studies done because hospitals were having a difficult time fighting to keep their trauma centers open.
Let’s suppose that the trauma hospital has a case fatality rate of 8.6% where as the general hospital has a case fatality rate of 5.6%. From this perspective it would seem as if it would be safer if we went to a general hospital as exposure to a trauma hospital seems to lead to a greater chance of death. This was what some people assumed by reading these numbers and it is not such a far off conclusion based off the numbers.
However, in this case, we are ignoring the level of injury or trauma that the patients who visit both hospitals may have when they come in. In this case, we will refer to this as the triage level. There are technically 5 levels. In order to reduce the amount of math, I am only going to consider 1–4. Hopefully…there aren’t too many patients dying when they are considered level 5.
Looking at the spread of different patients who went to either a trauma or general hospital. You might notice that there is a distinct difference between which type of patients went where. We are seeing a much higher proportion of patients that visited trauma hospitals were triage level 1 compared to general hopsitals(320 patients for trauma hospitals and 80 patients for general hospitals). That makes sense, you would want to refer more of your triage level 1 cases to a trauma hospital.
However, the difference between the two data sets becomes more apparent in the next table.
Here we see the difference in death rates per triage level. So although the overall case fatality rate is much higher in the Trauma Hospitals, the overall percentage of deaths for triage level 1s is higher in the general hospital(9.4% for Trauma Hospitals compared to 25% for General Hospitals). That means, if there would have been more patients brought to the general hospital that are triage level 1, there is a good chance their case fatality rate would be worse.
Now the next question is, how do we adjust the rates so they more fairly and concisely compare each case fatality rate per hospital type?
That is where this final table comes into play. The goal will be to create one standardized populations by adding the two populations together. Then, we use the percentage rates per death for each of the hospital types on the new standardized population.
This creates a fairer comparison of the two populations. In the end, you end up without about 51 total deaths out of 1202 people for trauma hospitals and 123 deaths out of 1202 people for general hospitals.
This in turn provides a very different story and a more fairly compared one at that. Now the new death rates are 4.3% for trauma hospitals compared to 10.3% for general hospitals.
When these numbers are provided to hospital administrators, they will have a clearer picture on what is actually occuring. With this information they can make a more informed decision without having to ask to much about context and if they do, then you can explain your methodology.
However, this method avoids bad assumptions being made without good questions being asked first.
Confounding variables are present in every area of research where correlations are being examined. Thus, it is important to look past the initial outputs and question whether there are possible underlying reasons for the numbers you are seeing.
When developing a report for directors and managers it is important to consider how you will simplify complex concepts like confounding variables. It is about telling a complete story. This occurs when you are able to concisely state a point across multiple data points and graphs.
If your team is looking for a team of data experts who can help develop your healthcare analytics, then feel free to reach out today.
If You’re interested in reading more about data science:
Time Series Model Development: Basic Models
How To Measure The Accuracy Of Predictive Models
What is A Decision Tree
How Algorithms Can Become Unethical and Biased
How Men’s Wearhouse Could Use Data Science Cont
Introduction To Time Series In R
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!