Loading files into databases are not exactly exciting work. It can be tedious and boring work. Often times it is constrained by tools like SSIS that freak out when a small change is made. This is one of the core roles of BI and data engineering teams. They need to load files from operational databases into data warehouses. They do this by developing pipelines that process and transform data. One problem we see is teams have multiple file types, data spec versions, and data providers. Each of which requires a different pipeline just to load into their raw database. This forces teams to manage 10, 20 maybe even 50 smaller pipelines that are all subject to the change of the data providers. In addition, each new data provider requires an entirely new pipeline to be developed. It also requires managing infrastructure around each pipeline in order to log and track data pipelines statuses. This can quickly become very difficult to manage and tedious to build. Another solution instead of attempting to manage multiple pipelines is to develop a central system that manages the various insert statements required to insert data into each different raw table. Kind of how we explain in the video below. What the system does is essentially act as a lookup system that matches flat files to their insert queries. This reduces the need to redundantly have separate pipelines that would also all need an insert statement.
Instead, now the system uses one main system to load all the various tables. Theoretically, you can also put version numbers on each of the files that come in. This could be set up by tracking each version in the metadata table that will then allow you to tie a raw file to an insert file version. Loading multiple file types by individual pipelines requires a lot of tedious development. Whether it be through SSIS or another system, the constant development quickly weighs heavily on data engineers. Instead of spending time developing analytical pipelines that drive clear value for your company. They will be stuck maintaining and developing operational pipelines. This reduces its impact and reduces its overall value. Thank you so much for reading! We hope you enjoyed the video. Please feel free to send us any questions. We would love to help you design your next automation system. Also, if you liked this, feel free to check out our video on using bulk insert! For further reading and videos on data science, SQL and Python: How To Develop Robust Algorithms Dynamically Bulk Inserting CSV Data Into A SQL Server SQL Best Practices - Designing An ETL - Part 1 How Algorithms Can Become Unethical and Biased 4 Must Have Skills For Data Scientists What is A Decision Tree
0 Comments
By Ben Rogojan We recently did an AMA on Reddit. The most common question that came up was what is the difference between a data scientist and a data engineer. So we wanted to make a more in-depth post on the subject. There are a lot of data specialist positions that sound similar and use similar tools so it can be difficult to know what the role of each role should do. In addition, smaller companies might be limited on how many data engineers or data scientists they can hire. This means that many times the specific tasks and goals might start to intermingle. This can make it much more difficult to clearly differentiate the two roles. So we wanted to go over how the two positions were different by discussing the different goals, mindsets, tools and backgrounds data engineers and data scientists have. Before we go into the difference we would like to make a quick preface. The truth is, many data scientists and data engineers will perform the tasks of other technical roles. A data scientists may need to develop an ETL and a data engineer might need to develop an API and front-end. So the distinctions we are pointing out below are just to make clear where the technical differences are. GoalsThe goals of a data engineer are much more task and development focused. Data engineers build automated systems and model data structures to allow data to be efficiently processed. This means the goal of a data engineer is to create and develop tables and data pipelines to support analytical dashboards and other data customers (like data scientists, analysts, and other engineers). It’s similar to most engineers. There is a lot of designing, assumptions, limitations, and development that occurs in order to be able to perform a final task. Each design and solution having its own set of limitations, even if it all can perform the end task. In comparison, data scientists tend to be question focused. In the sense that they are looking for ways to reduce costs/increase profits, improve customer experience or business efficiencies. This means they need to ask and then answer questions ( ask a question, hypothesize and then conclude). So they need to ask questions like, what impacts patient readmission, would a customer spend more if shown an add like A vs. B, is there a faster route to deliver packages? Skipping over the rest of the process. The goal from here is to find an answer to whatever question is posed. It might be a final conclusion or to more questions. Throughout the process, data scientists analyze, gather support and can develop a conclusion to the question. Tools
This is where things can get confusing. Data scientist and data engineers both often rely on python and SQL. However, how the two tech roles use these skills varies. Again, this ties back to the mindset differences. Python is a very robust language that has libraries that help manage operational tasks as well as analytical ones. Data scientists will use languages like Pandas and Scikit Learn whereas data engineers will use python to manage pipelines. This is where libraries like Airflow and Luigi come in handy. Similarly, data scientist queries will be ad-hoc focused (e.g. questions focused). Whereas data engineers queries will be focused on cleaning up and transforming data. Now there are other tools both data specialists might use. This includes Tableau, Jupyter notebooks and a handful of others. At the end of the day, it is how they use them that makes the distinction. BackgroundsNow, one other common question when it comes to the differences of data engineers and data scientists is what background is required. Data engineering and data science both require some understanding of data and programming. Even if it is a limited scope. However, there are some distinctions that go beyond programming. Specifically for data scientists. Due to the fact that a data scientist is more like a researcher, having a background that is research-based is a benefit. This might be in economics, psychology, epidemiology, etc. Combine a research background with SQL, Python and a good sense of business and you have a data scientist. Now, these are not set in stone degrees. In fact, we have run into a data scientist with various degrees. Most employers will prefer to hire a data scientist with at least a masters degree that has some sort of technical or mathematical focus. Data engineering positions usually won’t require a masters degree. Data engineering is more about being a developer. This requires much more practical experience rather than theoretical knowledge. So gaining a Masters does not supply the same value. An Example Let’s say a director of a healthcare company decide they would like to figure out how to reduce the number of patients readmitted prior to 30-days from their original visit. From a data point of view, there are a couple of things that need to occur. Data scientists will need to figure out what drives patient readmission. That is the question they will be trying to answer. Based on the conclusions they reach, they will work with the business to develop metrics and policies to help improve patient readmission rates. Data engineers will be developing tables to help support the data scientists answer the question while at the same time developing analytical tables to help track past and future patient readmission metrics. How these metrics are created will be driven by the answers the data scientist get. Data scientists and data engineers have plenty of differences. They have different goals, and backgrounds, but this is where the value of utilizing both together comes from. The fact that data engineers focus more on engineering robust systems allows data scientists to query data easily and analyze it efficiently. Their partnership is what brings companies value from data. We hope this post was helpful! Please feel free to reach out with any questions you may have. Read more about data science, healthcare analytics and automation below: How To Develop Robust Algorithms Dynamically Bulk Inserting CSV Data Into A SQL Server SQL Best Practices - Designing An ETL - Part 1How Algorithms Can Become Unethical and Biased 4 Must Have Skills For Data Scientists What is A Decision Tree Data engineering has many facets. One of the most common projects a data engineer takes on is developing an ETL pipeline from an operational DB to a data warehouse. Our team wanted to cover the overarching design of an ETL. What are the typical principal components, stages, considerations, etc? We started this by first writing Creating An ETL Part 1(more to come) and we and now have worked on a video that is below that walks through the process. We wanted to discuss why each stage is important and what occurs when data goes from raw to stage, why do we need a raw database and so on. Data engineering is a complex discipline that partners automation, programming, system design, databases, and analytics in order to ensure that analysts, data scientists and end-users have access to clean data. This all starts with the basic ETL design. We are practicing up for a webinar that we will be hosting on ETL development with Python and SQL. The webinar itself will be much more technical and dive much deeper into each component described in the video. However, we wanted to see how using the whiteboard was like in case we need it.
If you would like to sign up for the free webinar we will be hosting it on February 23rd at 10 AM PT. Feel free to sign up here! If you have other questions please do contact us. Read more about data science, healthcare analytics and automation below: How Algorithms Can Become Unethical and Biased How To Develop Robust Algorithms 4 Must Have Skills For Data Scientists Time Series Model Development: Basic Models How To Measure The Accuracy Of Predictive Models What is A Decision Tree An ETL (and it’s not so far off cousin ELT) is a concept that is not usually taught in college, at least not in undergrad courses. Yet, ELTs play an important piece of almost every company’s day to day operations. ETLs are the pipelines that populate data into business dashboards and algorithms that provide vital insights and metrics to managers. These pipelines go from very simple processes that extract data from operational databases into a data warehouse to complex analytical layers that support more complex transformations and applications. The problem is, many of our IT graduates are not taught this very important skill that they will more than likely use in their everyday life if they decide to pursue a career in data science or data engineering. Although data science work doesn’t really include ETLs technically, a lot of work they do can end up crossing over with data engineers. We wanted to cover the basics of an ETL and good ETL design. What does a good ETL look like, what are the key components, what makes a pipeline maintainable and healthy? These are all important questions data engineers and database managers need to ask before taking on new projects. This will be the first of several posts. In this post, we will be covering some basic concepts such as extracting/loading raw data and logging. These are important topics that have a lot of nuances that we would like to start to cover We will also be offering a free webinar February 23 10 AM PT to cover this topic. We would love to have you attended! Operational To Raw Operational Database All of the data your ETL will be ingesting will have some source that represents real life transactions of one sort of another. The data could represent patients visiting hospitals, people posting on social media or customers purchasing products. Some might ask why you can’t just analyze data from the operational database instead of going through the extra steps of processing the data into another system. The biggest issue is that trying to run analytical processes that aggregate and analyze large data set from the operational database will slow down the operational database for customers. In addition, operational databases are heavily normalized which make them inefficient to query from an analytical standpoint. This is why a data warehouse of one form or another is devel Here is an example of a method for bulk insert data into SQL Server The first step in the processes is to extract the data into some form of raw format. Raw File Extracts How the data is pulled from the operational database can vary. However, one of the most consistent ways is to extract the data into a raw file like a CSV or XML file. Extracting the data into a raw file has several benefits. One, a direct link that pulls data limits the interactions outside systems have with the operational database. This will avoid blocking important operational tasks that need to occur. In addition, it creates a snapshot of what the data looked like at the time of pulling. This can be beneficial if you ever need to reload data because of an issue that might occur in the pipeline. Having this back-up of data is beneficial just in case someone accidentally deletes crucial data. Then you will be very happy to have raw extracts that are easy to reload. These raw extracts should be tracked in some table that keeps track of what files have been loaded and where they exist. That way you can easily automate any reloading. Also, the files should have a naming standard. An easy example would be DateExtracted_DateRangeInTheFile_BusinessObject (e.g. 20170101_20160901_20161231_Customers.csv) Metadatabase And Logging Tracking what is occurring inside your ETL process is a crucial step in developing an automated, maintainable and robust system. How extensive your tracking and logging depends on how large your company is and how many ETLs/ELTs and other data transformations need to occur. Larger companies might have a standardized tool like Airflow to help manage DAGs and logging. However, if you are a start-up or a non-tech company, it will probably be ok to have a simplified logging system. The purpose of this system is to manage the files that need to be loaded/have been loaded, tracking of what stored procedures have run, errors, dependencies, etc. Having good insight into these aspects of your system will help maintain it, improve it and backfill any data that either has been missed or loaded incorrectly. This means you need to actually create 3–4 tables (at least) that track what is running, how long do they take, computation, etc. Especially as we push for more and more automated systems, this will be the key to making sure the systems are maintainable. Loading Raw Data Into Raw tables Once you have extracted the raw data from the operational databases and updated your metatable with the new file information, then you can start focusing on loading the data. It can be tempting to try to add a lot of complex business logic to your initial loads into raw. It can seem to simplify the process and create fewer steps in order to load the data. So why add more steps. The truth is adding too much logic to early can make it difficult to track where errors occur. Did the error occur in the extract, in the business logic or somewhere else? In addition, it becomes more difficult to update new logic because you have to analyze where the best place to update the logic will be. All of these extra considerations make the overall system more difficult to maintain. That is why it is important to just load the raw data as is. If there is de-duplication logic or mapping that needs to happen then it can happen in the staging portion of the pipeline. The next steps after loading the data to the raw database are QA and loading data into the staging database. We will continue that discussion in our next post. If you want to learn more about ETL development/automation then sign up for our ETL webinar where we will be discussing the process of creating an ETL from end to end. Read more about data science, healthcare analytics and automation below: Calculating A Running Total SQL How To Develop Robust Algorithms 4 Must Have Skills For Data Scientists Time Series Model Development: Basic Models How To Measure The Accuracy Of Predictive Models What is A Decision Tree Automation Best Practices |
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|