Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

5 Articles That Can Help You Improve Your Data Strategy

3/14/2021

1 Comment

 
Picture
Startaê Team on Unsplash
​

Even small companies these days on 
average have 47.81 terabytes of data that they manage. Regardless if you’re a small company or a trillion-dollar behemoth, data is driving decisions. But as data ecosystems become more complex, having the right data strategy is the foundation to succeeding with data.

​​Developing a data strategy involves more than just considering what data warehouse or ETL tools you will be using. You will also need to think through the various use cases and business initatives your company is taking on.


But where do we start with your data strategy.

In this article, we have shared several articles your teams may enjoy that focus on developing and improving your data team's strategy. 
​
1. The 5 Mistakes Ruining Your Data-Driven Strategy
Companies of all sizes have embraced using data to make decisions. However, according to a 2019 report from Goldman Sachs, it’s actually quite difficult for businesses to use data to build a sustainable competitive advantage.

Our team has worked with and for companies across industries. We’ve seen the good, the bad, and the ugly of data strategy. We’ve seen teams implement successful data lifecycles, dashboards, machine learning models, and metrics. We’ve also had to come in and untangle, delete, migrate, and upgrade entire data systems.

Throughout these projects, we’ve seen several issues that pop up repeatedly: alack of data governance; bad data; complex Excel documents; a lack of alignment between data teams and the businesses; and an over abundance of dashboards, leading to confused decisions.
All of these data issues compound over time and slowly erode a team or company’s ability to trust and use their data.
​

In this article, we’ll discuss some of these issues as well as possible solutions your teams can implement to improve your overall data lifecycle.

Read More Here

​2. How To Modernize Your Data Architecture 

​Data is continuing to prove to be a valuable asset for businesses of all sizes.

I say that both from the fact that consulting firms like McKinsey have found that in their research companies that are using AI and analytics can attribute 20% of their earnings to it.

Similarly, I have been able to consult for several clients and help them find new revenue sources as well as cost reduction opportunities.

There is one catch.

You will need to develop some form of data infrastructure or update your current one to make sure you can fully harness all the benefits that the modern data world has to offer.

Just to clarify, I don’t mean you need to use the fanciest and most expensive data tooling. Sometimes I have steered clients to much simpler and most cost-effective solutions when it comes to data analytics tooling.
​

In this article, we will discuss what you should avoid when building your data architecture and which questions you should be asking yourselves as you try to build out your future data infrastructure.
Read More Here

​3. 17 Questions You Need To Ask About Your Data Analytics Strategy
There are plenty of cliches about data and its likeness to oil or companies being data-driven. Is there truth to all this hype about data strategy, predictive modeling, data visualization, and machine learning?

In our experience, these cliches are true. In the past few years, we have already helped several small and medium-sized businesses take their data and develop new products, gain invaluable insights and create new opportunities for their businesses that they didn’t have before.

Many small and medium-sized businesses are starting to take advantage of the ease of access to cloud computing technologies such as AWS that allow your teams to perform data analysis easier and anywhere using the same technology billion-dollar corporations use at a fraction of the cost.  

So what are you doing to improve your business data strategy today?

To help answer this question our team has put together a data strategy assessment that will help highlight where your team is doing well and where it can improve on its data strategy.
​

Read More Here

​​4. How To Improve Your Data Science Teams' Efficiency


Companies of all sizes are looking into implementing data science and machine learning into their products, strategies, and reporting.

However, as companies start managing data science teams, they quickly realize there are a lot of challenges and inefficiencies that said teams face.

Although it has been nearly a decade since the over-referenced data scientist is the sexiest job article there are still a lot of inefficiencies that slow data scientists down. 
​

Data scientists still struggle to collaborate and communicate with their fellow peers across their departments. Also, the explosion of data sources inside companies has only made it more difficult to manage data governance. Finally, the lack of a coherent and agreed-upon process in some companies makes it difficult for teams to get on the same page.

All of these pain points can be fixed. There are tools and best practices that can help improve your data science teams' efficiencies. In this article, we will discuss these problems and how your team can approach them so you can optimize your data science team's output.
Read More Here

​5. Developing A Data Analytics Strategy For Small Businesses And Start-ups


If you’re a small business or start-up, you’re probably reading articles about companies using data science, data analytics, and machine learning to increase their profits and reduce their costs. In fact, Mckinsey just came out with a study that found that the companies they survey could attribute 20% of their bottom line to AI implementations. All those trendy and hyped up words are proving to be effective for companies of all sizes.

As data consultants, we have had the opportunity to help multiple clients in industries like healthcare, insurance, and transportation realize similar gains and cost savings. All of which started with us helping them determine what was the best data strategy for them.

In this article, we wanted to take you through a few of the steps we walk clients through to help them figure out their future data strategy. We hope this article can help you take into consideration what your goals are and perhaps how data can help you achieve those goals in 2021.
Read More Here


How Will You Improve Your Data Strategy This Year?

Using data to make better decisions provides companies a competitive advantage. However, this depends  on the quality of data and the robustness of data processes set up.
Simply creating dashboards, data warehouses and machine learning models is not sufficient to make data driven decisions.

Teams need to consider their data life-cycles and the processes used to manage each step. This means creating test cases, clear goals and processes can help improve your team’s performance and strategy. No one wants to get bogged down with too many processes and bureaucracy but having no form of plan or strategy for your team’s data life-cycle will also fail.
To avoid these problems, consider reading the articles above.

If you are interested in reading more about data science or data engineering, then read the articles below.
What Are ETLs and Why You Should Use Them
4 SQL Tips For Data Scientists
What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate
5 Great Libraries To Manage Big Data With Python
What Is A Data Warehouse And Why Use It
Hiring Data Science Guide – A Guide For Interviewing And Onboarding.
Kafka Vs RabbitMQ
SQL Best Practices — Designing An ETL Video

​
1 Comment

MLOps Best Practices - Ideas to Keep in Mind When Developing a ML Pipeline

3/10/2021

0 Comments

 
Picture
Photo by Karo Kujanpaa on Unsplash
By Travis Wolf
Introduction — “What is MLOps? — DevOps for ML” — Kaz Sato
Challenges arise as the production of machine learning models scale up to an enterprise level. MLOps plays a role in mitigating some of the challenges like handling scalability, automation, reducing dependencies, and streamlining decision making. Simply put, MLOps is like the cousin of DevOps.
It's a set of practices that unify the process of ML development and operation.
This article serves as a general guide for someone looking to develop their next machine learning pipeline, delivering summaries of topics that will introduce topics of MLOps.

1. Communication and collaboration between roles — “ML products are a team effort”

Production of a successful machine learning lifecycle is a lot like racing in formula one. From the outside, it appears that the driver is the only one responsible for getting the car around the track, but in reality, there are upwards of 80 team members behind the scenes. This is similar to developing an enterprise-level ML product. The data scientist sits in the driver’s seat, directing how the model will be built every step of the way. However, this assumes data scientists have expertise in every step of development which is commonly not the case. The driver is not going to get out and perform maintenance on the car, they need a team of engineers, mechanics, and strategists to be successful.
A successful data science team will also consist of roles that bring different skill sets and responsibilities to the project. Subject matter experts, data scientists, software engineers, and business analysts each play their important role, but they don’t use the same tools or share basic knowledge to communicate effectively with one another. That is why it is important to practice collaboration and communication between the roles every step of the way will ensure getting around the track as quickly as possible.

2. Establish Business Objectives — “Have a clear goal in mind”

Every business has key performance indicators (KPIs). These are measurable values that reflects how well a company is achieving its objectives. The first step in the machine learning lifecycle is taking a business question, and determining how it can be solved with ML, “What do you want from the data?”. Machine Learning model evaluation metrics are measured by accuracy, precision, and recall, how can the predictions be translated to real-world metrics that can be easily understood for project members outside of the data team?

Problems for data science teams arise all the time when they struggle to prove how their model is providing value to the company to stakeholders and upper management. Particularly, a model falls short because there weren’t clear objectives during the exploratory and maintenance phases of development. To combat this, broad business questions like, “How to get people to stay longer on a web-page?” need to be translated into performance metrics that can be set as a goal for the model to strive for. The point of this practice is to have a foundational starting point for the data engineers and scientists to work from, and avoiding the risk of solving a problem that doesn’t serve the business in the long run.

3. Obtaining the Ground Truth — “Validate the dataset”

Arguably the most important step in developing any machine learning model is the process of verifying the labels of the dataset. Ground truthing is imperative to train the model to result in predictions that accurately reflect the real world.

Legitimizing the source of the data and labeling it correctly can be an arduous process, and even maybe the most time-consuming of all. That is why it is important to recognize the amount of time and resources early on in the development process because depending on the size of the dataset could be a hindrance to model performance. For example, if the model is trained to detect objects in a picture, obtaining the ground truth will involve labeling each observation in the dataset with a bounding box, which has to be done manually.
After the model is put into production, the performance might drift away from the original predictions, and will not reflect the population that it had been trained on. In this case, retraining the model on new labels will be necessary and will cost time and resources to do so.

4. Choosing the Right Model — “Experiment and Reproduce”

Different challenges require different tools, and in this case algorithms. Algorithms can range from very simple like linear regression to advanced deep learning neural networks and everything in between. Each model has its advantages and disadvantages and will pose certain considerations affecting MLOps.

The best way to narrow down what algorithm from an MLOps perspective will depend on two things: what kind of data are they performing on, and how well they fit with the CI/CD pipeline. Having these two principles in mind will ultimately reduce dependencies and make life easier when it comes to the deployment phase.

This process will involve a lot of experimentation, validation, and ultimately be able to be reproduced consistently by the DevOps team for deployment.

Practice experimenting with simple models and working up in complexity. This will help you find the best balance between effectiveness and use of resources. The goal is to ultimately end up with a model that is plug and play (like an API), scalable, inputs and outputs are easy to understand and compatible with the production environment.

5. Determine the Type of Deployment — “Continuous Integration and Delivery”

There are typically two methods to consider when deploying a model once it reaches the production stage. Either embedding the model into an application, or staging it as a Software-as-a-Service, or in this case “model-as-a-service”. Each has its advantages and disadvantages, costs, and restraints.

But it is important to have this in mind before setting off on the development of a machine learning pipeline because certain software frameworks will only support specific packages. The production environment needs to be cohesive with the model of choice.

MLOps addresses the challenges that arise once the model is ready to enter production. MLOps borrows the continuous integration and continuous delivery (CI/CD) principle commonly used in DevOps. However, the difference between them is that with ML, the data is continuously being updated, as well as the models. While traditional software only requires code to have CI/CD.

Once the model has been trained and evaluated to perform in a way that delivers value to its user, the next step is deploying the pipeline in a way that can be continuously integrated as the data changes over time. This adds some challenges and needs for MLOps to continuously deliver newly trained and validated models.


6. Containerization — “Works every time, all the time”

Open-source software packages are often very rigid with their dependencies. In turn, this forces software to rely on the exact package versions and modules. Keeping with the theme of streamlining and standardization in MLOps, containerization serves as a way to automate the machine learning environments from development into production. Technologies like Docker serve as a way to practice containerization.

Tools like Docker provides an isolated environment for the model and accompanying applications to run in production. Ensuring the environment for the model and its accompanying application will always have the required packages available.

Each module of the pipeline can be kept in a container that keeps the environment variables consistent. This will reduce the number of dependencies for the model to work correctly. When there are multiple containers in play with deployment, then Kubernetes serves as a great MLOps tool.


Conclusion

In the past year, MLOps has seen exponential growth.

More businesses at an enterprise level are looking to invest in MLOps.

MLOps will serve to streamline, automate, and help scale their ML pipelines. The field is growing every day as well and not everything could be covered in the article. However, keep some of these ideas in mind. That way when your team is considering your next project you can introduce the concept of MLOps.

If you aren’t sure what you want to do with your data, then feel free to reach out to us. I would be happy to help outline some possibilities with you for free.
Drop some time on my calendar today!

Also, feel free to read more about data science and data engineering in the articles below

How Your Team Can Take Advantage Of Your Data Without Hiring A Full-Time Engineer
What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate
5 Data Analytics Challenges Companies Face in 2021 With Solutions
How To Write Better SQL - Episode 1

References
Introducing MLOps by (Mark Treveil, Nicolas Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki, Lynn Heidmann; 2020).
0 Comments

7 ETL and ELT Tools To Move Your Data Into Your Datawarehouse

11/20/2020

0 Comments

 
Picture
The rise in self-service analytics is a significant selling point for data warehousing, automatic data integrations, and drag and drop dashboards. In fact, in 2020, the largest software IPO this year was a data warehousing company called Snowflake.
The question is how do you get your data from external application data sources into a data warehouse like Snowflake?

The answer is ETLs and ELTs.

ETLs (Extract, Transform, Load) are far from new but they remain a vital aspect of Business Intelligence (BI). With ETLs, data from different sources can be grouped into a single place for analytics programs to act on and realize key business insights.

ELTs have the same exact steps referenced by ETLs except in a slightly different order. In particular, the major difference lies in when the transform step occurs. We will discuss in depth what the T stands for shortly. However, to talk about it abstractly, it references business logic, data pivoting and transformations that often take a lot of time and resources to maintain.

In this article we will cover the ETL various tools that you can use .

Different Types Of ETL  and ELT Tools
When it comes to styles of ETL and ELT tools there are a vast array of options.

Not all of which require code.

No Code/Low Code

That's right, there are plenty of ETL and ELT tools that fall into the low code/no code category. These tools range from drag and drop to GUI based. Some examples of these tools include Fivetran and SSIS(which we will discuss below).

You can often do everything from scheduling to dependency management without really knowing code(or what you are doing at all). There are pros and cons to these types of tools.

In particular, they can be quite rigid as far as if you need a more complex set of functionality that would be easy to implement in code.

That being said, most  of them will still allow you to write custom code or SQL. This is arguably one of the more important factors as you rarely will be able to get away from the "T" portion of an ELT.  This step requires some form of business layer be implemented. 

Workflow Automation Code Frameworks 
After low code/no code ETLs, there are workflow automation frameworks. This would be like Airflow and Luigi. Both of these are Python libraries that manage a lot of the heavy lifting infrastructure wise for automation.

For example, Airflow provides dependency management, scheduling, various operators that connect to cloud sources and destinations, logging and a dashboard to help track how your jobs are doing.

Each of these components might take a team of engineers to develop. However, with all of it already set up on pip install, much of the boring work is done.

From here you can focus on developing ETLs.

Custom Code
This option is rarely a good choice, unless your pipelines rarely need to run. Developers using this method are writing all their code from scratch and setting up jobs to run in Cron. You can use a  language like Python or even Powershell (which we have seen) to automate your processes.

From there you can have your SQL files called by your preferred wrapper language. Of course this also means you will need to develop your own logging and error handling system. Perhaps even a meta database.  

If you don't have complex dependencies this really isn't a problem. However, in most ETL pipelines, there will need to be an order of operations, which means you will run into issues.

In the next section we will discuss 7 different tools that range from No Code to code based frameworks.

7 ETL And ELT Tools

Picture
airflow python consulting

Airflow
​
Airflow is a workflow scheduler that supports both task definitions and dependencies in Python.

It was written by Airbnb in 2014 to execute, schedule, and distribute tasks across several worker nodes. Made open source in 2016, Airflow not only supports calendar scheduling but is also equipped with a clean web dashboard which allows user to view current and past task states.
Using what it calls operators, your team can utilize python while benefiting from the Airflow framework.
​
For example, you can create a Python function and then call it with the PythonOperator and quickly set what operator depends on to run before, when it should run and several other parameters.
In addition, all of the logging and tracking is already taken care of by Airflows infrastructure.


etl and elt consulting
Luigi

Luigi is an execution framework that allows you to write data pipelines in Python.

This workflow engine supports task dependencies and includes a central scheduler that provides a detailed library for helpers to build data pipes in MySQL, AWS, and Hadoop. Not only is it easy to depend on the tasks defined in its repos, but it’s also very convenient for code reuse; you can easily fork execution paths and use the output of one task as the input of the second task.

This framework was written by Spotify and became open source in 2012. Many popular companies such as Stripe, Foursquare, and Asana use the Luigi workflow engine.



Picture
SSIS

SSIS or SQL Server Integration Services is Microsoft’s workflow automation tool. It was developed to allow developers to make automation easy. SSIS does so by providing developers with drag and drop tasks and data transformations that just require a few parameters to be filled out. Also, SSIS’s GUI makes it very easy to see which tasks depend on what.

Because SSIS only allows for a limited number of data transformations, SSIS also offers a custom code transformation so data engineers aren’t limited to the basic transforms SSIS offers.


talend data consulting seattle

Talend


Talend is an ETL that has a similar feel to tools like SSIS. It has drag and drop blocks that you can easily select use for destinations, sources, and transformations. It connects to various data sources and can even help manage and integrate real-time data like Kafka.

Talend’s boast and or claim to fame is it is 7x faster and ⅕ the cost. However, when it comes down to it, most products will state something similar. It can take a little bit of fine-tuning to get that optimal performance. At the end of the day, your performance is connected more to who builds your pipelines and who designs your data warehouses vs the product you use.


Picture

Fivetran

Fivetran is a highly comprehensive ELT tool that is becoming more popular every day. Fivetran allows efficient collection of customer data from related applications, websites, and servers. The data collected is then transferred to other tools for analytics, marketing, and warehousing purposes.

Not only that, Fivetran has plenty of functionality. It has your typical source to destination connectors and it allows for both pushing and pulling of data. The pull connectors will pull from data sources in a variety of methods including ODBC, JDBC, and multiple API methods.

Fivetran’s push connectors receive data that a source sends, or pushes, to them. In push connectors, such as Webhooks or Snowplow, source systems send data to Fivetran as events.

Most importantly Fivetran allows for different types of data transformations. Putting the T in ELT. They also allow for both scheduled and triggered transformations. Depending on the transformations you use, there is also other features like version control, email notification, and data validations.


Picture

Stitch

Stitch was developed to take a lot of the complexity out of ETLs. One of the ways Stitch does this is by removing the need for data engineers to create pipelines that connect to APIs like in Salesforce and Zendesk.

It also attaches to a lot of databases as well like MySQL. But it’s not just the broad set of API connectors that makes Stitch easy to use.

Stitch also removes a lot of the heavy lifting as far as setting up cron jobs for when the task should run as well as manages a lot of logging and monitoring. ETL frameworks like Airflow do offer some similar features. However, these features are much less straightforward in tools like Airflow and Luigi.
Stitch is done nearly entirely in a GUI. This can make this a more approachable option for non-data engineers. It does allow you to add rules and set times when your ETLs will run.


Picture

Airbyte

Airbyte is a new open-source (MIT) EL+T platform that started in July 2020. It has a fast-growing community and it distinguishes itself by several significant choices:

Airbyte’s connectors are usable out of the box through a UI and an API, with monitoring, scheduling, and orchestration. Their ambition is to support 50+ connectors by EOY 2020. These connectors run as Docker containers so they can be built in the language of your choice. Airbyte components are also modular and you can decide to use subsets of the features to better fit in your data infrastructure (e.g., orchestration with Airflow or K8s or Airbyte’s…)

Similar to Fivetran, Airbyte integrates with DBT for the transformation piece, hence the EL+T. While contrary to Singer, Airbyte uses one single open-source repo to standardize and consolidate all developments from the community, leading to higher quality connectors. They built a compatibility layer with Singer so that Singer taps can run within Airbyte.

Airbyte’s goal is to commoditize ELT, by addressing the long tail of integrations. They aim to support 500+ connectors by the end of 2021 with the help of its community.

In Conclusion

Today’s corporations demand easy and quick access to data. This has lead to an increasing demand for transforming data into self-serviceable systems.

ETLs play a vital part in that system. They ensure analysts and data scientists have access to data from multiple application systems. This makes a huge difference and lets companies gain new insights.

There are tons of options as far as tools go, and if you’re just starting to plan how your team will go forward with you BI and data warehouse infrastructure you should take some time to figure out which tools are best for you.

What you pick will have a lasting impact on who you hire and how easy your system is to maintain. Thus, take your time and make sure you understand the pros and cons of the tools you pick.
From there, you can start designing based on your businesses needs.

If you are interested in reading more about data science or data engineering, then read the articles below.
How To Use AWS Comprehend For NLP As A Service
4 SQL Tips For Data Scientists
What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate
5 Great Libraries To Manage Big Data With Python
What Is A Data Warehouse And Why Use It
Kafka Vs RabbitMQ
SQL Best Practices — Designing An ETL Video
0 Comments

Hadoop vs HDFS vs HBase vs Hive - What Is The Difference?

5/15/2020

1 Comment

 
Picture
Photo by Hunter Harritt on Unsplash
​

With technology changing rapidly, more and more data is being obtained regularly.
A recent study suggests that around 2.7 Zettabytes of data exist today in the digital universe!

Therefore companies now require different software to manage huge amounts of data. They are constantly looking for ways to process and store massive amounts of data and distribute it across different servers so that the departments can easily operate and derive helpful results from it. 


​In today's article, we will discuss Hadoop, HDFS, HBase, and Hive, and how they help us process and store large amounts of data to extract useful information.

​
Picture

Photo by Davi Mendes on Unsplash
HADOOP 
Apache Hadoop is an open-source software that offers various utilities that facilitate the usage of a network on multiple computers to solve the problems on big data.

Hadoop also provides a software framework for distributed computing as well as distributed storage. For doing so it divides a document onto several stores and blocks across a cluster of machines. To achieve fault tolerance, Hadoop replicates these stores on the cluster. Nextly, it performs distributed processing by dividing a job into several smaller independent tasks.

​This task and then run in parallel over the cluster of computers as discussed earlier Hadoop works with distributed processing on large data sets across a cluster service to work on multiple machines simultaneously. to process any data on Hadoop, the client submits the program and data to this utility. The HDFS stores data while the MapReduce function processes the data and then YARN divides these tasks.

Let's discuss the working of Hadoop in detail:
  1. HDFS: HDFS or Hadoop distributed file system is a master-slave topology that has two daemons running; DataNode and NameNode.
  2. MapReduce: it is an algorithm that processes your big data in parallel on the distributed cluster. MapReduce can subsequently combine this data into results.
  3. YARN: the function of yarn is to divide a task on resource management as well as job monitoring/scheduling into separate daemons. Yarn can skill beyond just a few thousand nodes. This is because the yarn federation allows a user to via multiple sab clusters into one big cluster. We can use many independent clusters together in one larger job which can be achieved with a larger-scale system.

HDFS 

As mentioned earlier, HDFS is a master-slave topology running on two daemons; DataNode and NameNode. Datanode runs on the slave nodes it stores the data in Hadoop. On the very starting Datanode connects to NameNode and keeps checking for requests to access data. NameNode, on the other hand, stores directory of files into the file system. It is also responsible for cracking down where the cluster of file data resides however it does not store the data contained in those.

With the Hadoop distributed file system you can write data once on the server and then subsequently read over there and also use it many times. HDFS is a great choice to deal with high volumes of data required in the current day. The reason is that HDFS works with the main node and multiple nodes on a commodity of hardware cluster. All the nodes are usually organized within the same physical rack on the data center so when the data is broken down into different blocks, they are distributed among different nodes for storage. The blocks are also replicated across nodes to reduce the chances of failure. 

HDFS uses validations and transaction logs to ensure data integrity across the cluster. Usually, there is only one NameNode and a possible data node running on the physical server while all other servers only run data nodes.

HBase 

Although Hadoop is great for working with big data sets it only performs batch processing and therefore data can only be accessed sequentially. The good thing is that Hadoop has some other applications such as Hbase which can randomly access huge amounts of data files according to the user’s requirement. This is extremely useful for gaining concrete insights which are exactly why 97.2% of organizations are investing in big data-related tools and software.

Hbase is an open-source column-oriented database that is built on top of the Hadoop file system. It is horizontally scalable. The data model of HBase is very similar to that of Google's big table design. it not only provides quick random access to great amounts of unstructured data but also leverage is equal fault tolerance as provided by HDFS. 

HBase is part of the Hadoop ecosystem that provides read and write access in real-time for data in the Hadoop file system. Many big companies use HBase for their day-to-day functions for the very same reason. Pinterest, for instance, works with 38 clusters of HBase to perform around 5 million operations every second!

what's even greater is the fact that HBase provides lower latency access to single rows from A million number of records. to work, HBase internally uses hash tables and then provides random access to indexed HDFS files.

HIVENow while Hadoop is very scalable reliable and the great for extracting data its learning curve is very steep to be cost-efficient and time-effective. Another great alternative to it is the Apache hive. Hive is a data warehouse software that allows users to quickly and easily write SQL-like queries to extract data from Hadoop.

The main purpose of this open-source framework is to process and store huge amounts of data. In the case of Hadoop, you can implement SQL queries using MapReduce Java API while in the case of Apache Hive you can easily bypass the Java and simply access data using the SQL like queries. 

The working of Apache Hive is very simple it translates the input program written in HiveQL into one or more Java a MapReduce, Spark, or Tez jobs. It then organizes the data into HDFS tables and runs the jobs on a cluster to produce results. Hive is a simple way of applying structure to large amounts of unstructured data and then performing SQL based queries on them. since it uses interface very familiar with JDBC (Java Database Connectivity), it can easily integrate with traditional data center technologies.

Some of the most important components of the Hive are: 
  1. Metascore: This is the schema in which Hive tables are stored. The Hive Metastore is mainly used to hold all information regarding partitions and tables in the warehouse. It runs the same process as Hive service by default. 
  2. SerDe: SerDe or Serializer/Deserializer is a function that gives instructions to the hive regarding how a record is to be processed
​
CONCLUSIONIn the above article, we discussed Hadoop, Hive, HBase, and HDFS. All these open-source tools and software are designed to process and store big data and derive useful insights.

Hadoop, on one hand, works with file storage and grid compute processing with sequential operations. Hive, on the other hand, provides an SQL-like interface based on Hadoop to bypass JAVA coding. HBase is a column-based distributed database system built like Google's Big Table - which is great for randomly accessing Hadoop files. Lastly, HDFS is a master-slave topology built on top of Hadoop to store files in a Hadoop environment. 
​
All these frameworks are based on big data technology since their main purpose is to store and process massive amounts of data. 

​If you would like to read more about data science, cloud computing and technology, check out the articles below!
Data Engineering 101: Writing Your First Pipeline
5 Great Libraries To Manage Big Data
What Are The Different Kinds Of Cloud Computing
4 Simple Python Ideas To Automate Your Workflow
4 Must Have Skills For Data Scientists
SQL Best Practices --- Designing An ETL Video
5 Great Libraries To Manage Big Data With Python
Joining Data in DynamoDB and S3 for Live Ad Hoc Analysis
1 Comment

6 Tips to Start Your Career as a Data Scientist

3/23/2020

1 Comment

 

This is the era of Industry 4.0, and technology is developing further at a rapid rate. So, in this scenario, the skills required to work with these technologies also need to evolve - otherwise, they will become obsolete. Data Science, in particular, is growing rapidly in this decade; and honestly, you can expect it to become more popular in the near future. 

In fact, Analytics Training suggests that about 30 million TB data is generated every single day by over 6 billion devices that are connected to the internet! So, with the rise of big data, comes the need for highly skilled specialists who can interpret this data and extract useful information out of it. This is exactly where a data scientist comes in. 

In this article, we will discuss the top 6 tips to start a career as a data scientist. So let's begin:

What is a Data Scientist?

Before we move onto tips, let us first understand what a data scientist is exactly. A data scientist is basically an analytical data expert who is responsible for the collection, analysis, and interpretation of big data.

​Although this field was not on the radar a few decades ago, but this sudden popularity reflects how businesses are now adopting big data. A data scientist is a sort of magician who uses all this unstructured information to boost revenue by extracting useful business insights. This job is an offshoot of various traditional technical roles, which include maths, computers, science, and statistics. 


Picture
Photo by Clint Adair on Unsplash

Top 6 Tips to Start Career as a Data Scientist


Now, if you are interested in solving new, challenging problems in routine, data science is definitely the field you can make a career in. There is a huge demand for data science market - as research suggests - which the United States leads, requiring over 190,000 data scientists in the year 2020! Here are some tips that you can follow:

1. Choosing a Role: 
Choosing the right role is very important in this field. You can be a machine learning expert, a data engineer, a data visualizer, or even a data engineer, etc. Depending on your work experience and study background, you can choose the role that suits you. For instance, if you have studied, software engineering, getting into data engineering wouldn't be difficult for you. In order to make the best choice, it is suggested that you talk to people who are already in the field.

2. Take Up a Course: 
Now since you have chosen a role, the next thing is to put dedicated efforts into learning the work you will be required to do in the certain role. The awesome thing about data science is, you can learn a great deal about it online. In fact, According to a recent study by IBM, the demand for data scientists is expected to increase by 28% this year! 
You can go for Udacity, IBM, Coursera, EdX, etc. and take certifications or courses to polish your skills. All these online universities bring coursework, assignments, quizzes, case studies, and comprehensively prepared study flow. So taking up two or three courses can bring a lot of value to your work.

3. 
Build your Portfolio:
Once you have completed your courses, the next logical step is to build your data science portfolio. It is important to bring a well-thought-out portfolio when you are working in this super-competitive field. So now, since you don't hold relevant industry experience, you can share personal data science projects which demonstrate your skillsets. You can also share the course projects created while completing your online certifications. and lastly, you can create some volunteer projects and showcase them in your portfolio.

4. Network with Relevant People: 
Now that you are ready to jump in the race, it is essential to network with industry peers and even get support/ advice from them. In order to do so, the best option is to attend events and meetups in the field. It is important because not only will your peers keep you motivated but also help you overcome hurdles. We do have some meetups in Dublin, under the name of Dublin Data Science. So you can search for the ones in your area, and jump right in.
5. Master Soft Skills: 
Now some of the skills that are required to establish a career in data science, are obvious like you need to be expert at coding in a certain language or have a sound understanding of how the technology works. But there are some lesser-known skills that you need to master in order to stand out of the crowd. Soft Skills such as creative thinking, time management, and innovation are important for working in this sector. 

Picture
Photo by Shahadat Rahman on Unsplash

You see, data scientists have to approach daily routine problems by fusing their creative thinking with logical concepts to build best-fit solutions. In fact, a recent research by RS News suggests that now employers use scales from 0-10 for each skill when you go for a job interview. Therefore, mastering soft skills like this are also very important to make you stand out.

6. Follow the Field Experts:
No matter at which point of learning/ expertise you are, it is important to engulf the right sources of knowledge. One of the most useful sources can be the informative blogs being run by data scientists. Interestingly, experts in this field are quite active online and therefore, they continue to update their followers with findings, and advancements in Data Science frequently. Some of the best data scientists to follow include:
  1. Yoshua Bengio - He is the found of ApSTAT Technologies and also holds an experience of 22 years as Professor at University. He has also worked with MIT and AT&T as a machine learning researcher in the past.
  2. Geoffrey Hinton - He is a P.hD holder in AI from Edinburg and the co-founder of the ‘Back Propagation’ concept used in deep learning simulations as well as algorithms employed for training neural nets. He also worked on artificial neural networks with Google AI team in 2013 
  3. Peter Norvig - He is a data scientist and is currently employed as a research director at Google. In the past, he has also worked at NASA as head of Computational Science Divison. So you know, the level of authenticity is actually great.
  4. Jurgen Schmidhuber - He is an AI specialist currently working on RNN and Machine Learning (currently being used by Baidu, Google, IBM, and Microsoft). So far, he has published about 333 peer-reviewed papers and earned Hemholtz Award (2013) as well as IEEE Neural Networks Pioneer Award (2016).

So Are You Ready To Start Your Data Science Journey?
In this era, the demand for data science is huge and employers are actually investing both time and money for data scientists. Therefore, it is important for you to take the right steps in order to enjoy exponential growth. Before you step in, have a look at the tips we have shared above, and follow them in order to start off without making any costly mistakes.


If you would like to learn more about data science or data in general, please feel free to read some of the posts below!

What Is Dask, And How Can It Help Data Scientists?
​Data Engineering 101: How To Develop Your First Data Pipeline
How Do Machine Learning Algorithms Learn Bias?
Personalization With Contextual Bandits
How To Survive Corporate Politics As A Data Scientist
What Is A Decision Tree

​​
1 Comment

Data Science Use Cases That Are Improving the Finance Industry

11/10/2019

1 Comment

 
Picture

Over the years, the applications of data science methods are producing limitless innovations for various industries. For the finance world this is nothing new. The usage of quants and algorithmic trading has been goin on for decades. However, now with increased computation and cheap storage the ability to automate and improve these models have drastically changed.
The finance industry has been benefiting from the integrating data science and statistics into their processes for decades.
 
Several use cases are presently helpful for enhancing the customer experience, improving models and helping fight the data imbalance problem. These use cases are set to shape the future, and give rise to new processes in time to come.

The success witnessed in several use cases comes from the ability to seek the right algorithms, assembling suitable datasets, and building well-organized infrastructure. Data science is creating significant inroads within the financial services industry. And the solutions offered are seeing more implementation with Artificial Intelligence, Machine Learning, and Reinforcement Learning.

With the rapid increase in computing power and ability to store larger amounts of data, financial companies have a lot of opportunities to both improve the customer experience as well as increase profits. Here are a few great examples of how companies are using data science in the finance industry today.

1. Process Automation — Improving Scale Of Information (Robo Advisors And Customer Chatbots)

Process automation might not seem like a data science topic.

However, for almost every predictive model, there is an automated process required to ensure that the data being ingested is constantly up to date. 

The new technologies developed are seeking to automate repetitive processes, change manual tasks, and enhance productivity. Consequently, data science is also enabling several brands to create better scale-up solutions, cost optimization processes, and customer experience. Below are some of the major process automation use cases of data science process in the finance world:
  • Robo-Advisors
  • Chatbots
  • Call-Center Automation.
  • Predictive model selection
  • Paperwork automation, and more.

A few great examples of large corporations automating processes include JPMorgan Chase which unveiled its Contract Intelligence (COiN) platform. This platform makes use of natural language processing(NLP). 

The platform offers a solution to processing legal documents and extraction of vital data from them. A typical manual review would approximately take 360,000 labor hours for 12,000 annual commercial credit agreements. But, with the new Contract Intelligence (COiN) platform, the time to review a similar number of contracts would be only a few hours.

BNY Mello is also integrating process automation into its banking network. This system is responsible for several operational improvements and also responsible for $300,000 in annual savings.

Another example is Wells Fargo. It is using an AI-driven chatbot. The chatbot is set to communicate with users via the Facebook Messenger platform and offer assistance as regards to accounts and passwords. 

All of these systems will inevitably have some form of models that are amplified by process automation.

2. Improving Back And Forward Testing

In a live market, traders with a keen interest in trying new trading ideas often will develop models by backtesting outcomes to dictate if a system would be profitable. Backtesting denotes the application of a trading process to past data in verifying how it would perform during a specific time frame. Nowadays, backtesting in trading platforms offers:
  • The ability to test ideas using few keystrokes to gaining insight into how effective a system can be without risking funds.
  • The evaluation of simple ideas, for instance, the performance of moving average crossover on systems with varying inputs.

Backtesting remains a valuable option available in many third-party tools. The division of data into multiple sets (out-of-sample and in-sample testing) can allow traders with a practical and useful approach to evaluate a trading idea and system. 

As a result, most traders apply optimization techniques in backtesting in evaluating systems and integrate variables from user-defined input that permit system “tweaks.”

While backtesting offers valuable information, it may be misleading as it is just a part of the process of evaluation. 

Forward performance and out-of-sample testing offer more confirmation as regards to the effectiveness of a system and can give a real response to a system’s performance before the involvement of real money.

Forward performance offers traders a new range of out-of-sample information for system evaluation. It simulates actual trading and entails how such systems would act in an active marketplace. 

A key aspect to forward performance testing includes the exact following for the system’s logic, which helps to precisely evaluate the process.

A good correlation amongst backtesting, forward performance testing, and out-of-sample testing results is key to defining the practicability of a trading system. The continual use of forward performance testing and out-of-sample testing offers an extra safety layer before placing the system for market use. Positive outcomes alongside good correlation amongst out-of-sample and in-sample backtesting with forward performance testing enhance the prospect of a system performing well in actual trading.

3. Fighting The Imbalance Problem For Fraud Detection

The detection of financial fraud using data science is another major use case that is both currently being used and constantly improved.
 
Even with new methods of security such as adding a chip into cards and current predictive models that are attempting to reduce the fraudulent claims are far from perfect.

It is still predicted that online card fraud will rise to a tune of $32 billion in 2020. 

With data science, issues like fraud detection are typically considered as classification problems, which is the prediction of the class label with discrete output given a data observation. 

To solve such problems with data science, it requires the creation of models with enough intelligence to accurately classify transactions as either fraudulent or legit based on transaction details.

This unique solution using the data science approach also faces a significant challenge termed imbalanced data. 

In real-world finance, it usually arises from having data with a large part of its transaction classified as legit while only a small portion of the transaction are fraudulent. This is a similar problem with most fraud including insurance and healthcare.

For imbalanced data, the bottom-line is the prediction might be 99% accurate but the 1% of inaccuracy would be the fraud claims being labeled non-fraud. 

Investment in technology for tackling fraud has been on the rise over the years. Nevertheless, since imbalanced data offers a unique instance with most variables adding no context, it is always starting with some EDA (Exploratory Data Analysis) before using any prediction models. This allows the data scientists to see trends in the fraudulent claims themselves. Often times you can take a large data set of valid and invalid transactions and find traits that are unique to a subset of said transactions.

From there you can build a model on that subset of data to provide improved balance.

In addition, after EDA, there are several approaches to dealing with imbalanced data. The three popular approaches that stand out include combined class methods, oversampling, and undersampling.
​
In Conclusion

The benefits of data science in the finance world are seeking an enhanced customer experience, as well as applications for the future to offer more optimized systems and processes. By 2020,

Gartner forecasts a massive 84% of customer dealings with any enterprise would happen without another human interaction. With data science set to drive this revolution, the finance world can leverage more its applications to deliver practical solutions to proper use of data, security threats, and more. The future of the finance world is set to witness even more significant growth.

If you enjoyed this post, consider reading some of these posts!
Healthcare Fraud Detection With Python 
5 Uses Cases For DynamoDB
Learning Data Science And Machine Learning With Youtube And Coursera
Hadoop Vs Relational Database
How Algorithms Can Become Unethical and Biased
Top 10 Business Intelligence (BI) Implementation Tips​
5 Great Big Data Tools For The Future – From Hadoop To Cassandra

1 Comment

Memberful Analytics - Gaining Insights Into Retention, Adoption and Forecasting

9/3/2019

0 Comments

 
memberful analytics buy
Photo by Adeolu Eletu on Unsplash
Memberful is an amazing plugin that can be used by non-technical users to sidestep the hard work required to set up a subscription business. 

It is a plug and play system that takes away a lot of headaches and easily integrates with Wordpress, Mailchimp and stripe.

One of the limiting factors about Memberful is its lack of historical analytics. Memberful provides a lot of great data, but limited analytics on retention, adoption, market segmentation and forecasting future profits based on growth and or decline rates.


Picture

​This lack of awareness of these crucial metrics can be detrimental to your company. How can you make good strategic decisions if you are unaware of how your product is being adopted or retained over time.


Being able to track adoption and retention over time can help you track where you are making good investments or strategic changes. Without being able to look back, how can you ask, was that last marketing campaign successful and get a good answer.

Our team has worked to develop a report that can provide insights into your Memberful data. This report can cover retention and adoption as well as forecast future profits and segment your user based (depending on your customer data).

With these insights, your team can begin to more confidently make decisions as you will know when changes are working vs. not.

If you need help analyzing your Memberful data, then contact us today! We would be happy to discuss your needs.

If you would prefer to read more about analytics, data science and data engineering then please read the below!

Hadoop Vs Relational Database
How Algorithms Can Become Unethical and Biased
Top 10 Business Intelligence (BI) Implementation Tips​
5 Great Big Data Tools For The Future - From Hadoop To Cassandra
Creating 3D Printed WiFi Access QR Codes with Python
The Interview Study Guide For Data Engineers
0 Comments

How To Improve Your Data Driven Strategy

8/11/2019

0 Comments

 
Picture



Photo by Tabea Damm on Unsplash

Creating an effective data strategy is not as simple as hiring a few data scientists and data engineers and purchasing a tableau license. Nor is it just about using data to make decisions.

Creating an effective data strategy is about creating an ecosystem where getting to the right data, metrics and resources is easy. It’s about developing a culture that learns to question data, and look at a business problem from multiple angles before making the final conclusion.

Our data consulting team has worked with companies from billion dollar tech companies, to healthcare and just about every type of company in between. We have seen the good, the bad and the ugly of data being utilized for strategy. We wanted to share some of the simple changes that can help improve your companies approach to data.

Find A Balance Between Centralized And Decentralized Practices

Standards and over-centralization inevitably slow teams down. Making small changes to tables, databases and schemas might be forced to go through some overly complex process that keep teams from being productive.
​
On the other hand, centralization can make it easier to implement new changes in strategy without having to go to each team and then force them to take on a new process.

In our opinion, one of the largest advantages companies can gain is developing tools and strategies that help find a happy medium between centralized and decentralized. This usually involves creating standards to simplify development decisions while improving the ability to manage common tasks that every data team needs to perform like documentation and data visualization. While at the same time decentralizing decisions that are often department and domain specific.

Here are some examples where there are opportunities to provide standardized tools and processes for unstandardized topics.


​Creating UDFs and Libraries For Similar Metrics
​After working in several industries including healthcare, banking and marketing one thing you realize is that many teams are using the same metrics.

This could be across industries or at the very least across internal teams. The problem is every team will inevitably create different methods for calculating the exact same number.

This can lead to duplicate work, code and executives making conflicting decisions because of top-line metrics that vary.

Instead of relying on each team to be responsible for creating a process to calculate the various metrics you could create centralized libraries that uses the same fields to calculate the correct metrics. This standardizes the process while still providing enough flexibility for end-users to develop their reports based off their specific needs.

This only works if the metrics are used consistently. For example in the healthcare industry metrics such as per patient per month costs (PMPM), readmission rates, or bed turn over rates are used consistently. These sometimes are calculated by EMR like EPIC, but might still be calculated by analysts again for more specific cases. It also might be calculated by external consultants.

Creating functions or libraries that do this work easily can help improve consistency and save time. Instead having each team develop their own method you can simply provide a framework that makes it easy to implement the same metrics.

Automate Mundane But Necessary Tasks

Creating an effective data strategy is about making the usage and management of data easy.
A part of this process requires taking mundane tasks that all data teams need to do and automating them.

An example of this is creating documentation. Documentation is an important factor in helping analysts understand the tables and processes they are working with. Having good documentation allows for analysts to perform better analysis. However, documentation is often put off until the last minute or never done at all.

Instead of forcing engineers to document every new table, a great idea is creating a system that automatically scrapes the available databases on a regular interval and keeps track of what tables exist, who created them, what columns they have, and if they have relationships to other tables.
​
This would be a project for the devops team to take on, or you could also look into a third party system such as dbForge documentation for SQL Server. Now this doesn’t cover everything, and this tool in particular only works for SQL Server. But a similar tool can help simplify a lot of peoples lives.
Teams will still need to describe what the table and columns are. But, the initial work of actually going through and setting up the basic information can all be automatically tracked. 
​
This can help reduce necessary but repetitive work that can help make everyones life a little easier.



Provide Easier Methods To Share And Track Analysis

This is very specifically geared towards data scientist.

Data scientists will often do their work in Jupyter notebooks and Excel that they only have access to. In addition, many companies don’t enforce the need to use some form of repository like git so that data scientists can version control their work.

This limits the ability to share files as well as keep track of changes that can occur in one’s analysis over time. 

In this situation, collaboration becomes difficult because co-workers are often stuck passing files back and forth and self version controlling. Typically that looks like files with suffixes like _20190101_final, _20190101_finalfile…

For those of you who don’t get it, you hopefully never will have to.

On top of this, since many of these python scripts utilize multiple libraries it can be a pain to ensure that as you pip install all the correct versions onto your environment.
All of these small difficulties can honestly can cause the loss of a day or two due to trouble shoot depending on how complex the analysis is that you are trying to run.
However, there are plenty of solutions!

There are actually a lot of great tools out there that can help your data science teams collaborate. This includes companies like Domino Data Lab. 

Now, you can always use git and virtual environments as well, but this also demands that your data scientist be very proficient with said technologies. This is not always the case.

Again, this allows your teams to work independently but also share their work easily. 

Data Cultural Shift

Adding in new libraries and tools is not the only change that needs to happen when you are trying to create a company that is more data driven. A more important and much more difficult shift is cultural. 

Changing how people look and treat data is a key aspect that is very challenging. Here are a couple of reasons why.

Data Lies

For those who haven’t read the book, How To Lie With Statistics, spoiler alert, it is really easy to make numbers to tell the story you want.

There are a lot of ways you can do this.

A team can cherry pick the statistics they want to help their agenda triumph. Or perhaps a research team ignores confounding factors and reports on some statistic that seems to be shocking if you don’t consider all the other variables.

Being data driven as a company means that you need to develop a culture that attempts to look at statistics and metrics and ensures there isn’t anything interfering with the number. This is far from easy. When it comes to data science and analytics.
Most metrics and statistics often have some stipulations that could negate whatever message they are trying to say. That is why creating a culture that looks at a metric and asks why is part of the process. If it were as simple as just getting outputs and p-values. Then data scientists would be out of a job because there are plenty of third-party companies that have products that find the best algorithm and do feature selection for you.

But that is not the only job of a data scientist. They are there to question every p-value and really dig into the why of the number they are seeing.

Data Is Still Messy

Truth be told, data is still very messy. Even with todays modern ERPs and applications, data is messy and sometimes bad data gets through that can mislead managers an analysts.

This can be due to a lot of reasons. How the applications manage data, how system admins of those applications modified said system, etc. Even changes that seem insignificant from a business process side can majorly impact how data is stored.

In turn, when data engineers are pulling data they might not accurately be representing data because of bad assumptions and limited knowledge. 

This is why just having numbers is not good enough. Teams also need to have a good sense of the business and the process that create said data to ensure they don’t allow data that is messy into the tables which analysts use directly. 

Our perspective is that data analysts need confidence that the data they are looking at correctly represents their corresponding businesses processes. If analysts have to remove any data, or consistently perform joins and where clauses to accurately represent the business, then the data is not “self-service”. This is why, whenever data engineers create new data models, they need to work closely with the business to make sure the correct business logic is collected and represented in the base layer of tables.
​
That way, analysts can have near 100% trust in their data.

Conclusion

At the end of the day, creating an effective data culture requires a both top down and bottom up shift in thinking. From the executive level, decisions need to be made in what key areas they can help make access to data easier. Then teams can start working at becoming more proficient at actually using data to make decisions. We often find most teams spend too much time working on data tasks that need to get done but could be automated. Improving your companies approach to data can provide a large competitive advantage and allow your analysts and data scientists the ability to work on projects they both enjoy and help your bottom line!

If you team needs data consulting help feel free to contact us! If you would like to read more posts about data science and data engineering, Check out the links below!

Using Python to Scrape the Meet-Up API
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
0 Comments

Using Python to Scrape the Meet-Up API

8/9/2019

0 Comments

 
Picture
We recently posted some ideas for projects you could take on, to add to your resume and help you learn more about programming.
​
One of those projects involved scraping the Meet-up and Eventbrite APIs to create an aggregate site of events.

This is a great project and it opens up the opportunity to take on several concepts. You could use this idea to make an alerting system — the user inputs their API keys to track local events they have an interest in. You could develop a site to predict which live acts will become highly popular, before they get there, by tracking metrics over time.

Honestly, the APIs give a decent amount of data, even to the point of giving you member names (and supposedly emails too, if the member is authenticated). It’s a lot of fun — you can use this data for the basis of your own site!

To StartTo start this project, break down the basic pieces you will need to build the backend. More than likely you will need:
  • An API “Scraper”
  • Database interface
  • Operational database
  • Data-warehouse (optional)
  • ORM
To start out you will need to develop a scraper class. This class should be agnostic of the specific API call you’re making. That way, you can avoid having to make a specific class or script for each call. In addition, when the API changes, you won’t have to spend as much time going through every script to update every variable.

Instead, you’ll only need to go through and update the configurations.
That being said, we don’t recommend trying to develop a perfectly abstract class right away. Trying to build a perfectly abstracted class that has no hard-coded variables, from the beginning, can be difficult. If anything goes wrong or doesn’t work then it is harder to debug because of the layers of abstraction.

We’ll start by trying to develop pieces that work.

The first decision you need to make is where the scraper will be putting the data. We’re creating a folder structure in which each day has its own folder.

You could use a general folder on a server, S3, or a similar raw file structure. These offer the ability to easily store the raw data that we’re storing in a JSON file. Other data storage methods, like csv and tsv, are thrown off by the way the description data is formatted.

Let’s took a look at the basic script. Think about how you could start better configuring and refactoring the codebase to be better developed.
One place right off the bat is the API key. While you’re testing it’s easy to hard-code your own API key. But if your eventual goal is to allow multiple users to gain access to this data then you will want their API keys set up.

The next portion you will want to update is the hardcoded references to data you are pulling. This hard-coding limits the code to only work with one API call. One example of this is how we pull the different endpoints and reference what fields you would like to pull from what is returned.

For this example, we are just dumping everything in JSON. Perhaps you want to be very choosy — in that case, you might want to configure what columns are attached to each field.
​
For example:
This allows you to create a scraper that is agnostic of which API event you will be using. It puts the settings outside the code, which can be easier to maintain.

For example, what happens if Meet-up changes the API endpoints or column names? Well, instead of having to go into 10 different code files you can just change the config file.

The next stage is creating a database and ETL, to load and store all the data, and a system that automatically parses the data from the JSON files into an operational style database. This database can be used to help track events that you might be interested in. In addition, creating a data warehouse could help track metrics.

Perhaps you’re interested in the rate at which events have people RSVP, or how quickly events get sold out.

Based on that you could analyze what types of descriptions or groups that quickly run out of slots.
Personally, there is a lot of fun analysis you could take on.

Over the next few weeks and months, we’ll be working to continue developing this project. This includes building a database, maybe doing some analysis, and more!

We hope you enjoyed this piece!

If you enjoyed this video about software engineering then consider these videos as well!
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
​
0 Comments

Hadoop Vs Relational Databases

7/25/2019

4 Comments

 
Big data has moved from just being a buzzword to a necessity that executives need to figure out how to wrangle. 

Today the adoption of big data technologies and tools have witnessed significant growth with over 40% of organizations implementing big data as forecasted by Forrester, while IDC predicts that the big data and business analytics market is set to hit an all-time high of $274.3 billion in 2022 from $189.1 billion it’s expected to reach this year.

With this push for big data and big data analytics, finding the write system, best practices and data models that allow analysts and engineers access to the treasure troves of data can be difficult. Do you use traditional databases, columnar databases, or some other data storage system?
​
Let’s start with the this discussion by comparing a traditional relational database to Hadoop(specifically Hadoop partnered with a layer like Presto or Hive). 

​
Picture

​What is Apache Hadoop?
Hadoop is a distributed file system with an open-source infrastructure that allows for the distributing and processing of Big data sets.

Hadoop is designed to scale up from single servers to lots of machines, offering local storage and computation to each server. Apache Hadoop comes with a distributed file system and other components like Mapreduce (framework for parallel computation using a key-value pair), Yarn and Hadoop common (Java Libraries).

Presto 
Presto is a distributed SQL query engine that can be used to sit on top of data systems like HDFS, Hadoop, Cassandra, and even traditional relational databases. It allows analysts the ability to use the benefits of Hadoop with out having to understand the complexities and intricacies of what is going on underneath the hood. This allows engineers the ability to use abstractions such as tables to organize data in a more traditional data warehouse format.

What is Relational Database (DB)?
Relational DB is formed from a set of described tables from which data can be reassembled or assessed in various ways without needing to reorganize the entire database tables. I know this kind of sounds weird, but in its simplest form, RDB is the basics for all SQL as well as all database management systems like Microsoft SQL Server, Oracle and MySQL.
RDB can also be called RDBMS, which stands for Relational database management system. RDB is a database management system that works with a relational model. RDBMS is the evolution of all databases; it’s more like any typical database rather than a significant ban.

The Differences..

Data architecture and volume

Unlike RDBMS, Hadoop is not a database, but rather a distributed file system that can store and process a massive amount of data clusters across computers. However, RDBMS is a structured database approach in which data is stored in rows and columns which can be updated with SQL and presented in different tables. This structured approach of RDB limits its capability to store and process a large amount of data. So Hadoop, with Mapreduce or Spark can handle large volumes of data.

Data Variety
Data variety is typically referred to as the type of data processed. For now, we have three main types of data types; Structured, unstructured, and semi-structured. Relational DB can only manage and process structured and semi-structured data in a limited volume. RDB is limited in managing unstructured data. However, Hadoop leverages its ability to manage and process all of the above data types; structured, unstructured, and semi-structured data. As a matter of fact, Hadoop is now the fastest known method for managing and processing huge volumes of unstructured data

Datawarehouses And Hadoop
As stated earlier. Hadoop on it’s own isn’t a database. However, thanks to open source projects like Hive and Presto you can abstract the file system into a table like format that is accesible with SQL. 
This has allowed many companies to start switching over parts or all of their datawarehouses to Hadoop. 

Why?

It is for accessibility, and hopes for performance on cheaper machines. Whether or not this is actually working depends from company to company and data management team to data management team. 

Because although systems like hadoop promise better performance. There are a lot of downsides that don’t always get discussed.



Picture

​Weakness in RDBMS and Hadoop
“Before we get started, it’s going to seem as if I dislike Hadoop. This is not the case, I am just going to be pointing out some of the largest pitfalls and weaknesses.

Technical Abilities
We will get into the technical difficulties shortly. But before we cover the technical cons, we wanted to discuss the talent issue.

Both Hadoop and traditional relational databases require technical know how. Now, this might be up for debate, but generally speaking, most relational databases are arguably easier to use.
This is because there are so few moving pieces in comparison. With Hadoop you need to think about managing cluster, the Hadoop nodes, security, Presto or whatever interface you are using and really several other technically administrative tasks that take up lots of time and skill.

In comparison, most relation database systems like SQL Server or Oracle are “some what” more straightforward. Security is built in, performance tuning is built in and in most importantly, there is a larger talent pool of people who understand how to manage and use standard DBs.

Why do you think interfaces like Presto and Hive that are both very SQL like exist? It’s because data professionals needed a way to interface with Hadoop that was much more familiar.

So the biggest issue most companies face is not the complexity of Hadoop but the lack/cost of talent that can operate Hadoop correctly. 

Security issues
Unlike RDBMS, Hadoop faces a lot of security problems which can be challenging when managing complex applications. In fact, the original Hadoop releases had no authentication system set up under the assumption that the system would be running in a safe environment. 

More recent releases do have access and permissions, authentication and encryption modules. They aren’t that straight forward to use and usually require a decent amount of ramp up. This can make it difficult to support and scale if you are just using the Hadoop out of the box without any form of third-party like Hortonworks($$$). 

Functional Issues
Hadoop is designed with the concept of write once read many. Hadoop is not designed for write once and update many. So for data specialists who are used to have the ability to update, forget about it.

For those who aren’t into data modeling, the issue this causes might not be apparent right away. Nor is it all that exhilarating to understand…
But not being able to run an update statement limits a lot of modeling that can be beneficial from the perspective of data volume.

For example (we are about to go granular). Let’s say you want to track someone’s promotions in a company. Traditionally, in a RDMS you can simply track the employee_id, position and start and end date of said position. You don’t need to track all the days in between. When a position is switched you can update the end date, add a new row for that employee with a new position and the start date, leaving the end date null. It would look like the below.

This took two rows of data and we now have all the information required.

In comparison, there are a couple of ways you could store this data using Presto to save similar information

One method is to store a person’s position every day in a date partition. The downside here is you will essentially need a row for everyday the person was in said position. The problem here is you will be storing a massive amount of data. If you have 10,000 employees, that is 10,000 rows daily.

Another method would be to use a similar data model as the RDBMS. That is only have one row for every employee_id and position combination with a start and end date. However, using this method will only work if you have access to the previous days information. This is a limiting factor. 

In the end, you are probably storing much more data than is really necessary and or performing a lot of transactions that are unneeded. The point here is that although Hadoop can provide some advantages, it’s not always the best tool.

If you enjoyed this video about software engineering then consider these videos as well!

142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist
​
Solving The Balanced Bracket Problem


4 Comments
<<Previous
    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners