The rise in self-service analytics is a significant selling point for data warehousing, automatic data integrations, and drag and drop dashboards. In fact, in 2020, the largest software IPO this year was a data warehousing company called Snowflake. The question is how do you get your data from external application data sources into a data warehouse like Snowflake? The answer is ETLs and ELTs. ETLs (Extract, Transform, Load) are far from new but they remain a vital aspect of Business Intelligence (BI). With ETLs, data from different sources can be grouped into a single place for analytics programs to act on and realize key business insights. ELTs have the same exact steps referenced by ETLs except in a slightly different order. In particular, the major difference lies in when the transform step occurs. We will discuss in depth what the T stands for shortly. However, to talk about it abstractly, it references business logic, data pivoting and transformations that often take a lot of time and resources to maintain. In this article we will cover the ETL various tools that you can use . Different Types Of ETL and ELT Tools When it comes to styles of ETL and ELT tools there are a vast array of options. Not all of which require code. No Code/Low Code That's right, there are plenty of ETL and ELT tools that fall into the low code/no code category. These tools range from drag and drop to GUI based. Some examples of these tools include Fivetran and SSIS(which we will discuss below). You can often do everything from scheduling to dependency management without really knowing code(or what you are doing at all). There are pros and cons to these types of tools. In particular, they can be quite rigid as far as if you need a more complex set of functionality that would be easy to implement in code. That being said, most of them will still allow you to write custom code or SQL. This is arguably one of the more important factors as you rarely will be able to get away from the "T" portion of an ELT. This step requires some form of business layer be implemented. Workflow Automation Code Frameworks After low code/no code ETLs, there are workflow automation frameworks. This would be like Airflow and Luigi. Both of these are Python libraries that manage a lot of the heavy lifting infrastructure wise for automation. For example, Airflow provides dependency management, scheduling, various operators that connect to cloud sources and destinations, logging and a dashboard to help track how your jobs are doing. Each of these components might take a team of engineers to develop. However, with all of it already set up on pip install, much of the boring work is done. From here you can focus on developing ETLs. Custom Code This option is rarely a good choice, unless your pipelines rarely need to run. Developers using this method are writing all their code from scratch and setting up jobs to run in Cron. You can use a language like Python or even Powershell (which we have seen) to automate your processes. From there you can have your SQL files called by your preferred wrapper language. Of course this also means you will need to develop your own logging and error handling system. Perhaps even a meta database. If you don't have complex dependencies this really isn't a problem. However, in most ETL pipelines, there will need to be an order of operations, which means you will run into issues. In the next section we will discuss 7 different tools that range from No Code to code based frameworks. 7 ETL And ELT Tools Airflow Airflow is a workflow scheduler that supports both task definitions and dependencies in Python. It was written by Airbnb in 2014 to execute, schedule, and distribute tasks across several worker nodes. Made open source in 2016, Airflow not only supports calendar scheduling but is also equipped with a clean web dashboard which allows user to view current and past task states. Using what it calls operators, your team can utilize python while benefiting from the Airflow framework. For example, you can create a Python function and then call it with the PythonOperator and quickly set what operator depends on to run before, when it should run and several other parameters. In addition, all of the logging and tracking is already taken care of by Airflows infrastructure. Luigi Luigi is an execution framework that allows you to write data pipelines in Python. This workflow engine supports task dependencies and includes a central scheduler that provides a detailed library for helpers to build data pipes in MySQL, AWS, and Hadoop. Not only is it easy to depend on the tasks defined in its repos, but it’s also very convenient for code reuse; you can easily fork execution paths and use the output of one task as the input of the second task. This framework was written by Spotify and became open source in 2012. Many popular companies such as Stripe, Foursquare, and Asana use the Luigi workflow engine. SSIS SSIS or SQL Server Integration Services is Microsoft’s workflow automation tool. It was developed to allow developers to make automation easy. SSIS does so by providing developers with drag and drop tasks and data transformations that just require a few parameters to be filled out. Also, SSIS’s GUI makes it very easy to see which tasks depend on what. Because SSIS only allows for a limited number of data transformations, SSIS also offers a custom code transformation so data engineers aren’t limited to the basic transforms SSIS offers. Talend Talend is an ETL that has a similar feel to tools like SSIS. It has drag and drop blocks that you can easily select use for destinations, sources, and transformations. It connects to various data sources and can even help manage and integrate real-time data like Kafka. Talend’s boast and or claim to fame is it is 7x faster and ⅕ the cost. However, when it comes down to it, most products will state something similar. It can take a little bit of fine-tuning to get that optimal performance. At the end of the day, your performance is connected more to who builds your pipelines and who designs your data warehouses vs the product you use. Fivetran Fivetran is a highly comprehensive ELT tool that is becoming more popular every day. Fivetran allows efficient collection of customer data from related applications, websites, and servers. The data collected is then transferred to other tools for analytics, marketing, and warehousing purposes. Not only that, Fivetran has plenty of functionality. It has your typical source to destination connectors and it allows for both pushing and pulling of data. The pull connectors will pull from data sources in a variety of methods including ODBC, JDBC, and multiple API methods. Fivetran’s push connectors receive data that a source sends, or pushes, to them. In push connectors, such as Webhooks or Snowplow, source systems send data to Fivetran as events. Most importantly Fivetran allows for different types of data transformations. Putting the T in ELT. They also allow for both scheduled and triggered transformations. Depending on the transformations you use, there is also other features like version control, email notification, and data validations. Stitch Stitch was developed to take a lot of the complexity out of ETLs. One of the ways Stitch does this is by removing the need for data engineers to create pipelines that connect to APIs like in Salesforce and Zendesk. It also attaches to a lot of databases as well like MySQL. But it’s not just the broad set of API connectors that makes Stitch easy to use. Stitch also removes a lot of the heavy lifting as far as setting up cron jobs for when the task should run as well as manages a lot of logging and monitoring. ETL frameworks like Airflow do offer some similar features. However, these features are much less straightforward in tools like Airflow and Luigi. Stitch is done nearly entirely in a GUI. This can make this a more approachable option for non-data engineers. It does allow you to add rules and set times when your ETLs will run. Airbyte Airbyte is a new open-source (MIT) EL+T platform that started in July 2020. It has a fast-growing community and it distinguishes itself by several significant choices: Airbyte’s connectors are usable out of the box through a UI and an API, with monitoring, scheduling, and orchestration. Their ambition is to support 50+ connectors by EOY 2020. These connectors run as Docker containers so they can be built in the language of your choice. Airbyte components are also modular and you can decide to use subsets of the features to better fit in your data infrastructure (e.g., orchestration with Airflow or K8s or Airbyte’s…) Similar to Fivetran, Airbyte integrates with DBT for the transformation piece, hence the EL+T. While contrary to Singer, Airbyte uses one single open-source repo to standardize and consolidate all developments from the community, leading to higher quality connectors. They built a compatibility layer with Singer so that Singer taps can run within Airbyte. Airbyte’s goal is to commoditize ELT, by addressing the long tail of integrations. They aim to support 500+ connectors by the end of 2021 with the help of its community. In Conclusion Today’s corporations demand easy and quick access to data. This has lead to an increasing demand for transforming data into self-serviceable systems. ETLs play a vital part in that system. They ensure analysts and data scientists have access to data from multiple application systems. This makes a huge difference and lets companies gain new insights. There are tons of options as far as tools go, and if you’re just starting to plan how your team will go forward with you BI and data warehouse infrastructure you should take some time to figure out which tools are best for you. What you pick will have a lasting impact on who you hire and how easy your system is to maintain. Thus, take your time and make sure you understand the pros and cons of the tools you pick. From there, you can start designing based on your businesses needs. If you are interested in reading more about data science or data engineering, then read the articles below. How To Use AWS Comprehend For NLP As A Service 4 SQL Tips For Data Scientists What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate 5 Great Libraries To Manage Big Data With Python What Is A Data Warehouse And Why Use It Kafka Vs RabbitMQ SQL Best Practices — Designing An ETL Video
0 Comments
Leave a Reply. |
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|