Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

Data Tools For Data Scientists and Machine Learning Engineers

3/16/2017

0 Comments

 
Are you looking for great tools for machine learning, data science or data visualization? 

Currently, there are an overwhelming amount of options. Do you pick Tensorflow or Theano? Tableau or Qlik, MongoDB, Hadoop...oh dear. How do you know which data tool to use? Many of them are good, some, not so good. In the end, you might need an algorithm just to pick which data science tools are best for your team.

We just wanted to go over a few technologies we personally love to work with. This in no way is all of them, but these are definitely some of the best options out there. However, every use case is different, give us a call if you are trying to decide what tools would work best for your company or project! We would love to help.

​
Libraries and Languages

Tensorflow

Tensorflow itself is not a language. It is built off of C++ and Nvidia’s CUDA. The library itself is typically implemented using Python. However, it is not actually executed in python. Python just allows the end-user to design the data flow graph that will be used by the much faster lower level languages. Some people actually go down into the raw C++ level to further optimize their run times.

Even the overall design style of tensorflow can feel a little wonky if you are python programmer. Compared to python, you might feel like you are writing in a more model or math based language like mathematica or Matlab. You will declare several sets of variables before even actually setting an actual variable. This can be a little jarring for python purist. Overall, the basics are pretty easy to pick up.

The difficulty comes in actually understanding how to properly set up a neural network, convolution network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data drop out or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google).


R
​

You can’t get to far in the data science world without finding a few programmers who enjoy R. It is a very well developed language that is great for statisticians and CS majors alike. I do think it doesn’t offer the cool factor that python and other more ‘modern’ machine learning languages. However, it is a great workhorse and it is tried and true.

It can provide some newbies a false understanding of data science. Libraries to ensemble and boost algorithms require minimal knowledge of the algorithms themselves. This is great if you know why you are picking each algorithm. However, the illusion can lead to a false sense of understanding. 

As a side-note, Sql Server actually just implemented the ability to run some R functions in a query. Not 100% on the performance, but it would be pretty cool if you could run a lot of data analysis even before getting the data out of the system.


Caffe

The first time I was exposed to Caffe for deep learning was back when I took a compuational neuro class. We were programming in Matlab, but one of the other students happened to show me the work he was doing in his lab. It was all using the Caffe framework. Most of this requires modifying .sh configuration files. You can alter what type of network layers you  are using, how many neurons, drop out, etc. For those more accustomed to running command line files, it works pretty smoothly and it works both on GPUs and CPUs.


Data Visualization Tools

​Tableau

Tableau is arguably one of the most popular data visualization tools for analysts no matter their proficiency with technology. Whether you are an engineer who develops complex neural networks or a business analyst who prefers to model in excel. Tableau is a friendly and easy to use data visualization tool. It allows the end-user to develop visually appealing interactive reports that help executives make decisions quickly.

On top of that, if your company has its own Tableau server, it also allows for quick and easy data transfer through beautiful and effective reports. If you just want to download the data itself, Tableau also allows you to download CSVs, screenshots, PDFs etc. You can even have reports emailed to you on a specific cadence.

This tool was built with the end-user in mind. One of my favorite features is Tableau Public. It allows you to share your reports publicly. Obviously, you can’t do this with company data. However, there are plenty of fun open data sets that can be used to make some beautiful and effective data reports. Check it out!
(Credit of Tableau goes to https://public.tableau.com/profile/gabe.dewitt#!/)

D3.js

One of our employees was first introduced to D3js in college.
Professor Jeff Heer, came and gave his class a 1 hour lecture on the library.He was instantly sold. The power D3 had to display data and let the end user drill into specific facts was amazing. This was his first exposure to data visualization in this manner.


Sure, before this he had seen matlab charts, and excel graphs, but nothing like this. He found he could use his skills with D3 to create graphs that were appealing, informative and interactive. There were so many benefits for an end user. Plus, unlike Tableau and other data viz tools. D3js allowed for almost unlimited customization. ​

We use it when customers are trying to avoid the steep costs of tableau, along with some other js libraries. There are some limits. For instance, D3 runs on the client side, this means it cannot handle the sheer magnitude of data that Tableau can.
​
DOMO

Domo has some similarities to Tableau. It allows the end-user to have very pretty graphs and is used for KPI reports. It quickly integrates with over 1000+ data sources and manages large amounts of data quiet easily. From their it quickly melds data and creates pre-formatted KPI reports that can be shared across their internal platform. This is great, especially if your team doesn’t have enough resources to develop highly effective reports. Within minutes, your team can have standardized reports from tools like Salesforce, Concur, etc. In addition, if integrated properly, your company may be able to reduce its reporting tools. Thus, reducing maintenance, development and design costs.

There is some ability for modification. However, it is limited compared to most other data visualization tools. This will drive typical developers crazy. We love being able to get down to system level and actually modifying what each small component does and not being limited by buttons. However, if your team can't afford the extra person to create the reports, then this tool will save a large amount of resources.
​
Picture
Data Storage Tools


Hadoop
​

You can’t say big data and not have at least one person bring up Hadoop. Hadoop is not for the fain of heart. It does an amazing job at distributing storage of very large data sets over on computer clusters. However, unless you are comfortable with Java and command line type environments, it is not an easy beast to wrangle. It requires a heavy amount of configuration and manipulation to ensure it works optimally on your companies system. The reward at the end though, is more than well worth it. The ability to access data quickly, even with hardware failure is pretty hard to beat. For small companies, the maintenance and costs to employee a Hadoop specialist would probably be too large. 

​


Mongo DB

At some point in time, data storage was very expensive and databases had to be finely tuned to perfectly manage every byte and bit. Thus, the relational model was decided to be one of the best options of data storage.

Now we are in 2017 and due to both hardware and software advances. Data has become much cheaper. Suddenly, storing large masses of unstructured data is feasible and can be beneficial when designed well. MongoDB is a document store database. Instead of storing rows, it stores an entire document in one instance. This means you no longer have to query the entire database just to get two related data points. Instead, they will all remain on the same document(this used to be bad because this meant a lot of duplicated data). Now it is great. It allows speed to increase dramatically. There are plenty of pitfalls with MongoDB. This includes security, storage, and it is not ACID compliant(click here to read about ACID).


Oracle

Oracle in itself could represent most other RDBMS. However, I find Oracle one of the best databases for managing big data. This has many reasons. Everything from the underlying architecture of the DB itself, to the ability to manage and manipulate the configuration of objects inside is much better tuned in Oracle. SQL Server is great for beginners, but it just doesn’t do what Oracle can.

0 Comments



Leave a Reply.

    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners