Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

Hadoop Vs Relational Databases

7/25/2019

4 Comments

 
Big data has moved from just being a buzzword to a necessity that executives need to figure out how to wrangle. 

Today the adoption of big data technologies and tools have witnessed significant growth with over 40% of organizations implementing big data as forecasted by Forrester, while IDC predicts that the big data and business analytics market is set to hit an all-time high of $274.3 billion in 2022 from $189.1 billion it’s expected to reach this year.

With this push for big data and big data analytics, finding the write system, best practices and data models that allow analysts and engineers access to the treasure troves of data can be difficult. Do you use traditional databases, columnar databases, or some other data storage system?
​
Let’s start with the this discussion by comparing a traditional relational database to Hadoop(specifically Hadoop partnered with a layer like Presto or Hive). 

​
Picture

​What is Apache Hadoop?
Hadoop is a distributed file system with an open-source infrastructure that allows for the distributing and processing of Big data sets.

Hadoop is designed to scale up from single servers to lots of machines, offering local storage and computation to each server. Apache Hadoop comes with a distributed file system and other components like Mapreduce (framework for parallel computation using a key-value pair), Yarn and Hadoop common (Java Libraries).

Presto 
Presto is a distributed SQL query engine that can be used to sit on top of data systems like HDFS, Hadoop, Cassandra, and even traditional relational databases. It allows analysts the ability to use the benefits of Hadoop with out having to understand the complexities and intricacies of what is going on underneath the hood. This allows engineers the ability to use abstractions such as tables to organize data in a more traditional data warehouse format.

What is Relational Database (DB)?
Relational DB is formed from a set of described tables from which data can be reassembled or assessed in various ways without needing to reorganize the entire database tables. I know this kind of sounds weird, but in its simplest form, RDB is the basics for all SQL as well as all database management systems like Microsoft SQL Server, Oracle and MySQL.
RDB can also be called RDBMS, which stands for Relational database management system. RDB is a database management system that works with a relational model. RDBMS is the evolution of all databases; it’s more like any typical database rather than a significant ban.

The Differences..

Data architecture and volume

Unlike RDBMS, Hadoop is not a database, but rather a distributed file system that can store and process a massive amount of data clusters across computers. However, RDBMS is a structured database approach in which data is stored in rows and columns which can be updated with SQL and presented in different tables. This structured approach of RDB limits its capability to store and process a large amount of data. So Hadoop, with Mapreduce or Spark can handle large volumes of data.

Data Variety
Data variety is typically referred to as the type of data processed. For now, we have three main types of data types; Structured, unstructured, and semi-structured. Relational DB can only manage and process structured and semi-structured data in a limited volume. RDB is limited in managing unstructured data. However, Hadoop leverages its ability to manage and process all of the above data types; structured, unstructured, and semi-structured data. As a matter of fact, Hadoop is now the fastest known method for managing and processing huge volumes of unstructured data

Datawarehouses And Hadoop
As stated earlier. Hadoop on it’s own isn’t a database. However, thanks to open source projects like Hive and Presto you can abstract the file system into a table like format that is accesible with SQL. 
This has allowed many companies to start switching over parts or all of their datawarehouses to Hadoop. 

Why?

It is for accessibility, and hopes for performance on cheaper machines. Whether or not this is actually working depends from company to company and data management team to data management team. 

Because although systems like hadoop promise better performance. There are a lot of downsides that don’t always get discussed.



Picture

​Weakness in RDBMS and Hadoop
“Before we get started, it’s going to seem as if I dislike Hadoop. This is not the case, I am just going to be pointing out some of the largest pitfalls and weaknesses.

Technical Abilities
We will get into the technical difficulties shortly. But before we cover the technical cons, we wanted to discuss the talent issue.

Both Hadoop and traditional relational databases require technical know how. Now, this might be up for debate, but generally speaking, most relational databases are arguably easier to use.
This is because there are so few moving pieces in comparison. With Hadoop you need to think about managing cluster, the Hadoop nodes, security, Presto or whatever interface you are using and really several other technically administrative tasks that take up lots of time and skill.

In comparison, most relation database systems like SQL Server or Oracle are “some what” more straightforward. Security is built in, performance tuning is built in and in most importantly, there is a larger talent pool of people who understand how to manage and use standard DBs.

Why do you think interfaces like Presto and Hive that are both very SQL like exist? It’s because data professionals needed a way to interface with Hadoop that was much more familiar.

So the biggest issue most companies face is not the complexity of Hadoop but the lack/cost of talent that can operate Hadoop correctly. 

Security issues
Unlike RDBMS, Hadoop faces a lot of security problems which can be challenging when managing complex applications. In fact, the original Hadoop releases had no authentication system set up under the assumption that the system would be running in a safe environment. 

More recent releases do have access and permissions, authentication and encryption modules. They aren’t that straight forward to use and usually require a decent amount of ramp up. This can make it difficult to support and scale if you are just using the Hadoop out of the box without any form of third-party like Hortonworks($$$). 

Functional Issues
Hadoop is designed with the concept of write once read many. Hadoop is not designed for write once and update many. So for data specialists who are used to have the ability to update, forget about it.

For those who aren’t into data modeling, the issue this causes might not be apparent right away. Nor is it all that exhilarating to understand…
But not being able to run an update statement limits a lot of modeling that can be beneficial from the perspective of data volume.

For example (we are about to go granular). Let’s say you want to track someone’s promotions in a company. Traditionally, in a RDMS you can simply track the employee_id, position and start and end date of said position. You don’t need to track all the days in between. When a position is switched you can update the end date, add a new row for that employee with a new position and the start date, leaving the end date null. It would look like the below.

This took two rows of data and we now have all the information required.

In comparison, there are a couple of ways you could store this data using Presto to save similar information

One method is to store a person’s position every day in a date partition. The downside here is you will essentially need a row for everyday the person was in said position. The problem here is you will be storing a massive amount of data. If you have 10,000 employees, that is 10,000 rows daily.

Another method would be to use a similar data model as the RDBMS. That is only have one row for every employee_id and position combination with a start and end date. However, using this method will only work if you have access to the previous days information. This is a limiting factor. 

In the end, you are probably storing much more data than is really necessary and or performing a lot of transactions that are unneeded. The point here is that although Hadoop can provide some advantages, it’s not always the best tool.

If you enjoyed this video about software engineering then consider these videos as well!

142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
​
Solving The Balanced Bracket Problem


4 Comments

10 Amazing Programming Projects To Help You Learn How to Program

7/12/2019

0 Comments

 
One of the common questions we get when it comes to learning how to program, is: “What are some good ideas for projects to build?”

Now, we hear the common cliche answers often, like “build a chess game” or “command line interface”. There is nothing wrong with these answers.

However, we think these examples don’t match modern programming needs. A big portion of modern software is SaaS and web apps. This means you need to know how to program online.
There are a lot more complexities that go into programming a website, or app, that has users, requires servers, authentication, and databases. This forces you to interact with technologies you’ll never need when developing a command line tic-tac-toe game.

Some of this can also be managed by AWS and other third parties that are commonly used by large corporations. Again, exposing you to technologies that are useful and heavily used in industry. This will be far more practical on a resume, as well as help you learn how to use new technologies.

Tip: If you pick a complex project, then focus on building one feature at a time. Building an entire website or app all at once is difficult. Start by building a login page, or maybe the main landing page after the user logs in.If you try to take on the whole project at once then you will likely fail.

Entertainment

1. A web scraper that posts top 10 blogs without human intervention

One of the issues we find with some project recommendation posts, is thatthey recommend projects that aren’t implemented in a way that excites the programmer to continue development. For instance, I see that a lot of people recommend building a web scraper.
Once you’ve built that web scraper and scraped the data, what are you going to do with it?

Instead of just scraping the data, why not build a website with that data. It doesn’t have to be fancy or get a lot of views. This scraper could pull the data into a database and then select the most popular posts. From there it could copy the title, along with a few sentences, and then create a post that it shares online. This would be an impressive and simple project that you can actually show off.
You’ve now shown that you can do more than just code a small segment of a system. Instead, you can think through an entire system. You need to consider how you are going to automate the process, manage the database, create the website and select the posts. This also allows you to actually have a tangible end product.

Without a tangible end product, it’s really easy to become unmotivated and simply stop at only a web scraper.

In addition, you never know, maybe your site will become popular!

Skills: Database, web scraper, automation, web development (for the blog), and general programming.

2. An event-alert system using Meetup and Eventbrite APIs

Have you ever wanted to go to a band or comedian show, but realized it was last week? Maybe there was a free conference in your area on data science or big data and you missed out because you forgot to check.

Why not make your own aggregator using the Meetup and Eventbrite APIs, that will warn you when keywords are in event descriptions or titles? Now, I assume both Meetup and Eventbrite have similar options. But it is always fun to try to build your own system.

You can customize the system to work the way you want, and maybe even allow other people to make their own alerts by making this a website. What we enjoy about this project is that you can practice working with two different APIs. This will allow you to compare and contract what you like and dislike about them. That way, if you’re ever in charge of building an API, you’ll have a better picture of what works and what doesn’t.

Skills: APIs, database, automation, web development, and general programming.

3. A 9GAG copy cat
Picture
You don’t always need to try to reinvent the wheel when creating your own projects. Simple projects like a site that lets you login, post photos, GIFs, and lets you scroll through a feed, provides an opportunity to create a solid base site first. Then you can add lots of interesting features like following, liking, and search. Search in particular would be a great chance to learn how recommendation systems and machine learning work!

It’s always fun to try and replicate popular sites. In fact, it is actually a great way to learn because you have to reverse-engineer each feature. Reverse engineering is a great skill, because as a software engineer you will constantly be maintaining other people’s code and you will need to get in their heads.

Skills: Machine learning (for recommendation system), database, automation, web development, and general programming.

Retail Type Sites

4. A gift recommendation app

Have you ever struggled to find the right gift for your friend? What if you could create a website that helps to predict what to buy a friend for a gift. It could allow the end user to either create an account or just get a gift recommendation.

Again, this allows for the opportunity to create an account which requires authentication, database development, etc.

Also, another great part about this project is you can use Amazon’s API for affiliate links. This will allow you to do a few things. One, learn about how to use APIs and get you comfortable with reading API documentation. Two, if you do it well, you can get a commission for each product someone buys.

This project also has an opportunity to try to create a basic machine learning model. You can create a quiz of sorts that tries to figure out what the best gift is and then, based on if people click the gift or not, can drive the model to learn based on the response rate.

Skills: APIs, database, general programming, and app development.

5. A site for bartering and trading

Think OfferUp, but instead of money, why not create a website that only allows trades. This concept will force you to develop several features that need some thought. You won’t be able to just attack this project without a plan.

How will people post, where will people find recently posted items and how will people search. All of these are separate features you can build. In addition, you need to think how users will interact, and maybe even how they actually make the trade.

The idea doesn’t have to be 100% practical for real life — it needs to be practical in the sense of improving your skill set as a programmer.

Skills: Database, web development, general programming, and app development (if you choose to make it an app).

B2B

6. Invoice and contract management system

Contract and invoice management are very complex processes. Contracts can have a lot of nuanced clauses and stipulations that can be difficult to track.

This makes this a very good project, even if you simplify it to some of its core components. Having to translate a complex business process into software is not easy. But it is what makes this project a good challenge.

Again, we wouldn’t overcomplicate this. Take a basic feature, like inputting the terms of a contract, and develop this part first. Then you can add other features like invoice tracking, contract analytics and forecasting.

Skills: Process management, database, web development, and general programming.

7. Task management system

Task boards like KanbanFlow are built with several modular features that make it a great project to play around with. It will take a little work to get started, as you will need to set up a UI that is robust and dynamic as well. In fact, this project would be more of a two person job. One person to work on the front end and another person to work on the back end.

Don’t let that discourage you! This is actually a chance for you to work on your communication and team work skills. You will need to talk through designs to make sure you both fully understand it, and you know where your modules will be connecting.
This is always more challenging than it seems.

Skills: Communication, front end, database, web development, and general programming.

8. A job board

Any project that forces you to allow users to input as various types of users adds an interesting design aspect. How will you ensure that the way employers experience the site meets their needs vs. prospecting job searchers? Like most of the other projects, you don’t need to focus on all of it at once. Start out by trying to create the ability to create a job posting first. Then you can go and focus on the job searchers and how they respond.

Skills: Database, web development, and general programming.

9. A website that forecasts profits based on standardized data sets

There are a lot of data sets that are very standardized for most companies. This includes accounting data which is usually based on cost centers, accounts, line descriptions, and finally the actual transaction cost.
What is great about the standardization of any data set, is that it makes it easy to create analytics on top of said data sets. Why not create a standardized dashboard that can help companies predict spend, see monthly outgoings, and possibly help them improve their spending.
For this project you will probably have to spend a lot of time learning about how to make sure you keep your data secure. Of course, we recommend first trying to build the modules that focus on uptaking and standardizing the data and displaying it, before you go too deep into security. That’s a rabbit hole you may never escape!
Skills: Forecasting, business logic, database, web development, and general programming

Game Ideas

10. Snake
Picture
If you had a cellphone in the early 2000s, you’ve probably played Snake. It’s a simple game but you can always try to make things more complex! First, start by just trying to make the game.
This will require you to figure out how to develop a game online. This neon Snake by Sebastian Opperman is a great place to start. But after that, maybe you can add some cool new features like special items or special powers.

This would be a chance to play around and have fun. This project won’t be as technical from the stand point of having lots of users that sign up and use your site. However, it is a good challenge to figure out how to make a game run online.

Skills: Web development, general programming, and UI

We do hope this list inspires you to create an awesome new project that you can add to your resume and talk about in interviews. Maybe we’ll see you as the next CEO of a billion dollar startup!

Are You Interested In Learning About Data Science Or Tech?
10 Amazing Articles On Python Programming And Machine Learning
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Engineering Dashboards, Metrics And Algorithms Part 2
Read Last Weeks Top Ten Article For Python Libraries
The Best And Last Python Tutorial You Will Ever Watch
How Algorithms Can Become Unethical and Biased
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
SQL Best Practices — Designing An ETL Video



0 Comments
    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners