Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

Hadoop Vs Relational Databases

7/25/2019

4 Comments

 
Big data has moved from just being a buzzword to a necessity that executives need to figure out how to wrangle. 

Today the adoption of big data technologies and tools have witnessed significant growth with over 40% of organizations implementing big data as forecasted by Forrester, while IDC predicts that the big data and business analytics market is set to hit an all-time high of $274.3 billion in 2022 from $189.1 billion it’s expected to reach this year.

With this push for big data and big data analytics, finding the write system, best practices and data models that allow analysts and engineers access to the treasure troves of data can be difficult. Do you use traditional databases, columnar databases, or some other data storage system?
​
Let’s start with the this discussion by comparing a traditional relational database to Hadoop(specifically Hadoop partnered with a layer like Presto or Hive). 

​
Picture

​What is Apache Hadoop?
Hadoop is a distributed file system with an open-source infrastructure that allows for the distributing and processing of Big data sets.

Hadoop is designed to scale up from single servers to lots of machines, offering local storage and computation to each server. Apache Hadoop comes with a distributed file system and other components like Mapreduce (framework for parallel computation using a key-value pair), Yarn and Hadoop common (Java Libraries).

Presto 
Presto is a distributed SQL query engine that can be used to sit on top of data systems like HDFS, Hadoop, Cassandra, and even traditional relational databases. It allows analysts the ability to use the benefits of Hadoop with out having to understand the complexities and intricacies of what is going on underneath the hood. This allows engineers the ability to use abstractions such as tables to organize data in a more traditional data warehouse format.

What is Relational Database (DB)?
Relational DB is formed from a set of described tables from which data can be reassembled or assessed in various ways without needing to reorganize the entire database tables. I know this kind of sounds weird, but in its simplest form, RDB is the basics for all SQL as well as all database management systems like Microsoft SQL Server, Oracle and MySQL.
RDB can also be called RDBMS, which stands for Relational database management system. RDB is a database management system that works with a relational model. RDBMS is the evolution of all databases; it’s more like any typical database rather than a significant ban.

The Differences..

Data architecture and volume

Unlike RDBMS, Hadoop is not a database, but rather a distributed file system that can store and process a massive amount of data clusters across computers. However, RDBMS is a structured database approach in which data is stored in rows and columns which can be updated with SQL and presented in different tables. This structured approach of RDB limits its capability to store and process a large amount of data. So Hadoop, with Mapreduce or Spark can handle large volumes of data.

Data Variety
Data variety is typically referred to as the type of data processed. For now, we have three main types of data types; Structured, unstructured, and semi-structured. Relational DB can only manage and process structured and semi-structured data in a limited volume. RDB is limited in managing unstructured data. However, Hadoop leverages its ability to manage and process all of the above data types; structured, unstructured, and semi-structured data. As a matter of fact, Hadoop is now the fastest known method for managing and processing huge volumes of unstructured data

Datawarehouses And Hadoop
As stated earlier. Hadoop on it’s own isn’t a database. However, thanks to open source projects like Hive and Presto you can abstract the file system into a table like format that is accesible with SQL. 
This has allowed many companies to start switching over parts or all of their datawarehouses to Hadoop. 

Why?

It is for accessibility, and hopes for performance on cheaper machines. Whether or not this is actually working depends from company to company and data management team to data management team. 

Because although systems like hadoop promise better performance. There are a lot of downsides that don’t always get discussed.



Picture

​Weakness in RDBMS and Hadoop
“Before we get started, it’s going to seem as if I dislike Hadoop. This is not the case, I am just going to be pointing out some of the largest pitfalls and weaknesses.

Technical Abilities
We will get into the technical difficulties shortly. But before we cover the technical cons, we wanted to discuss the talent issue.

Both Hadoop and traditional relational databases require technical know how. Now, this might be up for debate, but generally speaking, most relational databases are arguably easier to use.
This is because there are so few moving pieces in comparison. With Hadoop you need to think about managing cluster, the Hadoop nodes, security, Presto or whatever interface you are using and really several other technically administrative tasks that take up lots of time and skill.

In comparison, most relation database systems like SQL Server or Oracle are “some what” more straightforward. Security is built in, performance tuning is built in and in most importantly, there is a larger talent pool of people who understand how to manage and use standard DBs.

Why do you think interfaces like Presto and Hive that are both very SQL like exist? It’s because data professionals needed a way to interface with Hadoop that was much more familiar.

So the biggest issue most companies face is not the complexity of Hadoop but the lack/cost of talent that can operate Hadoop correctly. 

Security issues
Unlike RDBMS, Hadoop faces a lot of security problems which can be challenging when managing complex applications. In fact, the original Hadoop releases had no authentication system set up under the assumption that the system would be running in a safe environment. 

More recent releases do have access and permissions, authentication and encryption modules. They aren’t that straight forward to use and usually require a decent amount of ramp up. This can make it difficult to support and scale if you are just using the Hadoop out of the box without any form of third-party like Hortonworks($$$). 

Functional Issues
Hadoop is designed with the concept of write once read many. Hadoop is not designed for write once and update many. So for data specialists who are used to have the ability to update, forget about it.

For those who aren’t into data modeling, the issue this causes might not be apparent right away. Nor is it all that exhilarating to understand…
But not being able to run an update statement limits a lot of modeling that can be beneficial from the perspective of data volume.

For example (we are about to go granular). Let’s say you want to track someone’s promotions in a company. Traditionally, in a RDMS you can simply track the employee_id, position and start and end date of said position. You don’t need to track all the days in between. When a position is switched you can update the end date, add a new row for that employee with a new position and the start date, leaving the end date null. It would look like the below.

This took two rows of data and we now have all the information required.

In comparison, there are a couple of ways you could store this data using Presto to save similar information

One method is to store a person’s position every day in a date partition. The downside here is you will essentially need a row for everyday the person was in said position. The problem here is you will be storing a massive amount of data. If you have 10,000 employees, that is 10,000 rows daily.

Another method would be to use a similar data model as the RDBMS. That is only have one row for every employee_id and position combination with a start and end date. However, using this method will only work if you have access to the previous days information. This is a limiting factor. 

In the end, you are probably storing much more data than is really necessary and or performing a lot of transactions that are unneeded. The point here is that although Hadoop can provide some advantages, it’s not always the best tool.

If you enjoyed this video about software engineering then consider these videos as well!

142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
​
Solving The Balanced Bracket Problem


4 Comments
sandipan mukherjee link
4/23/2021 04:55:27 am

yes you are right...Apache Hadoop is a framework used to process ample amounts of data at once with a cluster setup. Hadoop services will allow you to store and process big data with an easy-to-access and compact framework.

Reply
sandipan mukherjee link
8/11/2021 12:43:12 am

yes you are right.. When the technology world struggled with the large volume of data, frameworks that can process an enormous amount of data came into existence. Big data denotes large datasets that are structured and unstructured. Big data framework, with its holistic structure, has become the most sorted solution for enterprises for processing large data.

Two of the most popular big data frameworks are Apache Hadoop and Apache Spark. But, when both of the frameworks are equally popular and widely used, people often find them in a dilemma. Whether they should go for Hadoop or Spark? Which is better? Which framework is the best suited for their enterprise? These are some questions which we often get asked

Reply
SunCart link
8/24/2021 06:38:22 am

Hello there,
Nice explanation about Hadoop vs Relational Database. You can also avail SunCart Services related to Magento Extensions, WordPress Plugin & Odoo Apps. Explore our products at https://www.suncartstore.com/ and email us products@sunarctechnologies.com

Reply
Odox link
9/21/2021 02:19:57 am

Good post thank you...

Odoo the <a href="https://odoxsofthub.com/products/odoo-erp-for-insurance-industry"> top Insurance Management System </a> is a complete, secure, and performance-based insurance management system designed to capture workflows for daily operations in an insurance company. All elements of insurance management are integrated into a single application. A unique solution to improve operational effectiveness and efficiency by utterly and immediately automating all insurance calculations and processes.

Reply



Leave a Reply.

    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners