By Ben Rogojan
Data science is just one of the modern data driven fields in our new data world. Another job that is even more prevalent than data scientist is data engineer. Now being a data engineer does not have all the hype behind it of being a data scientist. However, companies like Google, Facebook, Amazon and Netflix all need great data engineers!
Data engineering requires a combination of knowledge from data warehousing, to programming in order to ensure the data systems are designed well and are as automated as possible.
The question is: How do you prepare for an interview for a data engineering position?
Many of the questions will require you to understand data warehouses , scripting, ETL development, and possibly some NO-SQL if the company uses a different form of data storage system like CouchDB.
In case you are preparing for a data engineering interview here are some questions that might help you. We are focusing on conceptual questions. However, you should also work on some technical skills like SQL, python, etc.
How would you approach developing a new analytical product as a data engineer?
As a data engineer you control what is possible in the final product. A data scientist can’t build algorithms or metrics without having the data at the right granularity and the right data.
This means a data engineer needs to understand the entire product. A data engineer can’t just get away with building systems based off requirements. They need to ask why they are building certain tables and objects.
If the stakeholders already have a general outline of what they want it is very helpful. If they don’t have one, we would want to work with them to develop a general idea of what metrics and algorithms will exist. This drives all the major decisions. What data should be pulled it, how long should it be stored, should it be archived, etc.
Once a general outline exists, then the next step would be drilling into the why of each metric. This is because as you’re building different tables at different data granularities, certain issues might arise. Should the unique keybe on columns A and B, or A, B and C. Well, that depends, why is this important? What does that row signify? Is it customer level, store level, maybe brand level?
Once your team has gone through the process of working on the outline with your stakeholders and gained an understanding of the why. The next step is to think through as many operational scenarios as possible.
Will you need to reload data ever? Do your ETLs allow for it? Is it efficient? What happens when X occurs? How do you handle case Y?
You can’t spend all day doing this, but trying to think of as many issues that could occur helps develop a more robust system. It also helps create a system that actually meet requirements.
From there, it’s about developing the design, creating test cases, testing the tables, stored procedures and scripts and then pushing to production. How that occurs usually depends team to team.
What is the difference between an operational database and a data warehouse?
If you took a database course in college you probably learned about how to set up a standard normalized database. This style of database is optimized for transactions that involve Insert, Update and Delete SQL statements. This is a standard operational databases. They need to focus on making transactions quickly without getting bogged down by calculations and data manipulations. Thus, their design is a little more cumbersome for an analysts. Generally you will have to join several tables just to get a single data point.
A data warehouse is not concerned as much with dealing with millions of fast transactions every second. Instead, a data warehouse is usually built to support a data analytics product and analysis. This means performance is not geared towards transactions, but aggregations, calculations and select statements. A data warehouse will have a slightly denormalized structure compared to an operational database. In most data warehouses a majority of tables will take on two different attributes, a historical transaction table and tables that contain categorical style data. We reference these as fact and dimension tables. The fact table is essentially in the center. Unlike in a normalized database where you might have to join across several tables to get one data point. A standard data warehouse usually has a focus on the fact tables and all the dimension tables join to provide categorical information to the fact table(it’s also typically bad practice to join fact table to fact table, but sometimes it can occur if the data is created correctly). Here is an example of the structure of a data warehouse.
These are not the only tables that exist in a data warehouse. There are aggregate tables, snapshots, partitions, and more. The goal is usually a report or dashboard that can be automatically updated quickly.
Data warehouses also have a lot of other nuances like slowly changing dimensions. However, that is a whole other can of worms.
Tell us about a time you had performance issues with an ETL and how did you fix it?
As a data engineer you will run into performance issues. Either you developed an ETL when the data was smaller and it didn’t scale or you’re maintaining older architecture that is not scaling. ETLs are built up of multiple components, multiple table inserts, merges, and updates. This makes it difficult to tell exactly where the ETL issue is occurring. The first step is identifying the problem, so you need to figure out where is the bottleneck occurring.
Hopefully..whoever set up your ETL has an ETL log table somewhere that tracks when components finish. This makes it easy to spot bottle necks and the biggest time sucks. If not, it will not be easy to find the issue. Depending on the urgency of the issue, we would recommend setting up an ETL log table and then rerunning to identify the issue. If the fix is needed right away, then you will probably just have to go piece by piece of the ETL to try to track down the long running component. This also depends on how long the ETL takes to run. There are ways you can approach that as well depending on the what component relies on what component
Issues vary wildly from table locks, slow transactions, loops getting stuck, etc.
Once you have identified the issue, then you need to figure out a fix. This depends on the problem but the solutions could require adding an index, removing an index, partitioning tables, batching the data in smaller pieces (or sometimes even larger pieces it seems counter intuitive but this would depend on table scans and indexes). Depending on the storage system you are using it is good to look into the activity monitor to see what is happening on an I/O level. This will give you a better idea of the problem.
When you look at the activity monitor you can see if there is any data being processed at all? Is there too much data being processed, none, table locks? Any of these issues can choke an ETL and would need to be addressed.
If you Google some of the performance issues then you will find some people blaming the architecture for a lot of the problems. We don’t disagree with them. However, this doesn’t mean you throw in the towel. There are always various ways to manage performance if you can’t touch the actual structure. Even beyond indexes there are some parallel processing methods that can be used to speed up the ETL or adding temporary support tables to lighten the load.
Question Number 4? Are you experienced in python, powershell, bash, and or Java?
This question is more to note that it is important to have a scripting language on your side as a data engineer. Typically, most shops we have worked in won’t just rely on ETL tools like SSIS. Instead, they will often use custom scripts and libraries to load data. It might seem like overkill, but it is often easier. Using tools like SSIS are great if you need all the fancy functions like sending emails and some of the other add-ons. Even these could be scripted instead of written.
So we do recommend having some scripting abilities as well. It allows you to easily automate data flow and analytical tasks.
These are just some general questions to help get ready for your data engineering interview. Besides these questions you should also look into the concept of slowly changing dimensions, automation with python or powershell, some basic linux commands and design concepts. If you do have a data engineer interview, then we wish you good luck!
Call To Action:
Our team has experience developing multiple types of data pipelines from start to finish. We have experience creating effective data architecture and thinking about the end use cases. If you need help building a data warehouse or an entire data analytics products please reach out!
How To Grow As A Data Scientist
How To Survive Corporate Politics As A Data Scientist
Boosting And Bagging: How To Develop More Robust Machine Learning Models
8 Top Python Libraries For Machine Learning
What Is A Decision Tree
Other Data Science Services Our Team Provides
We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!