In our last post, we discussed getting a goal from high-level stakeholders and making sure it comes to you as a clear and concise statement or one pager. This makes it easier for data engineers and product developers to see why they are creating what they are creating. So in this step, we will be discussing the data source as well as starting our discussion on the value of designing prototypes. These topics are somewhat hinged on each other. The prototype can only represent metrics based on your data source. Prototyping Design a prototype is a key step in developing any product. Now, in this case, I am referring to purely a design and not an MVP. The prototype doesn’t need to be at all functional. Instead, it is more of a tool used to sell to executives. This good is a fake report with somewhat accurate numbers, a set of slides that demonstrate how an application could work, etc. The point is to have something your audience can see and can visualize using. This will help you sell your final product, even before you have it. Oddly enough…this can be more valuable than your actual technical work because if you can’t sell an idea to your stakeholders, it won’t be picked up. How much time you spend on developing your prototype can also depend on your company culture as well as what the companies focus is. If the company’s focus is to create analytics to sell to other companies, then they will spend a lot of time on prototyping (or at least, they should). This is because they have to convince the company to continue paying for the product for years to come. Also, if your company is very rigorous when it comes to developing internal tools, then you again, might spend a lot of time designing and prototyping. However, there are many companies that will draw up a dashboard on a whiteboard and call that the prototype. There is nothing wrong with that, but the final product might not have the same buy-in. This eventually leads to the final product being forgotten quickly. When the stakeholders are forced to be brought into parts of the process, and also forced to dedicate time to it(as long as the product has a clear impact) they will continue to use the tool after it is finished because they will feel more attached as well as understand the product better. This isn’t a fact that I could quote, or provide research on, but I have at least seen it anecdotally to be true. When the end-users weren’t forced to put time into thinking about the product, they can easily dismiss it in the future. This, in turn, wastes the time of the engineers and the resources of the company. Prototyping is a key step in the case. It helps crystallize what exactly the data engineers are trying to create and the impact it will have. In addition, it forces the stakeholders to own more than just the concept of the product. The limitation of what the prototype can be when referring to data products and algorithms is the data source, which will talk about next. Read The Rest Here Are You Interested In Learning About Data Science Or Tech? The Advantages Healthcare Providers Have In Healthcare Analytics 25 Of The Best Data Science Courses Online Learning Data Science: Our Favorite Data Science Books What Is Data Science Really As Told By An Ex-FAANG Data Scientist How Algorithms Can Become Unethical and Biased How To Load Multiple Files With SQL How To Develop Robust Algorithms Dynamically Bulk Inserting CSV Data Into A SQL Server 4 Must Have Skills For Data Scientists SQL Best Practices — Designing An ETL Video
0 Comments
Our team has collected several posts on learning data science. These posts will cover courses, books, and youtube videos. We hope they can help you on your journey. 25 OF THE BEST DATA SCIENCE COURSES ONLINE Bootcamps and Specializations 1. Introduction to Probability and Data This course introduces you to sampling and exploring data, as well as basic probability theory and Bayes’ rule. You will examine various types of sampling methods, and discuss how such methods can impact the scope of inference. A variety of exploratory data analysis techniques will be covered, including numeric summary statistics and basic data visualization. You will be guided through installing and using R and RStudio (free statistical software), and will use this software for lab exercises and a final project. The concepts and techniques in this course will serve as building blocks for the inference and modeling courses in the Specialization. Take The Course 2. Full Statistics Courses In this Specialization, you will learn to analyze and visualize data in R and create reproducible data analysis reports, demonstrate a conceptual understanding of the unified nature of statistical inference, perform frequentist and Bayesian statistical inference and modeling to understand natural phenomena and make data-based decisions, communicate statistical results correctly, effectively, and in context without relying on statistical jargon, critique data-based claims and evaluated data-based decisions, and wrangle and visualize data with R packages for data analysis. Take The Courses 3. The Data Scientist’s Toolbox In this course you will learn how to program in R and how to use R for effective data analysis. You will learn how to install and configure software necessary for a statistical programming environment and describe generic programming language concepts as they are implemented in a high-level statistical language. The course covers practical issues in statistical computing which includes programming in R, reading data into R, accessing R packages, writing R functions, debugging, profiling R code, and organizing and commenting R code. Topics in statistical data analysis will provide working examples. Take The Courses Read More Here LEARNING DATA SCIENCE: OUR FAVORITE DATA SCIENCE BOOKS In data science, there are many topics to cover, so we wanted to focused on several specific topics. This post will cover books on python, R programming, big data, SQL and just some generally good reads for data scientists. Data Science Books As a data scientist, you have a very important role. Your goal is to provide your company insights into improving the companies bottom or top line. The problem is, we can make data say anything we want. It can be very easy to manipulate data to prove that our feature was effective and it can be tempting if the company incentivizes that type of behavior. Thus, a great general read for data scientists (and really anyone in our modern world) is Naked Statistics. This is kind of like the much older book How To Lie With Statistics which you can read for free. We do prefer Naked Statistics because it is a little more modern and covers much more complex statistical debauchery than its much older counterpart. It just goes to show you that numbers are at your whim and you have a lot of responsibility to make sure your numbers are right. If something seems amiss with your data…it probably is. Rather than reporting it out right away, think about how you might unknowingly be miss representing the facts. Another similar book is Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. LEARNING DATA SCIENCE: OUR FAVORITE RESOURCES FROM FREE TO NOT Data science has many facets. Statistics, data cleansing, programming, system design and really…almost anything else data related depending on how large the company is. This post will discuss our favorite resources for these topics. Now, most of these courses and books are primers for topics like statistics, Python and data science in general. They really will only provide the base knowledge. At the end of the day, real practical experience is one for the few things that will really train your data science knowledge. You should learn as much as you can from these resources and then apply for as many internships and entry-level positions as possible and study for interviews. You will learn much more and gain more than just technical knowledge. You will also gain a lot of business experience. Free Statistics Courses Let’s start with, learning/reviewing basic statistical concepts. Many of you have probably taken a statistics course or two in college. But you might not remember everything clearly so it’s a good idea to review from the beginning. It can be tempting to try to start taking on complex statistical concepts and models. But most algorithms and models require some sort of accuracy and hypothesis testing. This means you actually need to be able to understand concepts like p-test vs at-value, z-statistics vs t-statistics, ROC vs AUC., random variables, etc. These all seem like basic concepts and maybe you kind of remember these words. However, we find they often get forgotten as many of us focus more on learning how to implement models in python and R vs basic statistics. Although both of these concepts do not necessarily rely on each other. You can start to assume you understand what the p-value means when you run models in either language without fully grasping the importance of it. Statistics Courses 1. Khan Academy This is why we recommend at least going back and walking through the Khan Academy statistics section. They cover concepts like Hypothesis testing, T-statistics vs Z-Statistics, Confidence Intervals, etc. Khan is always a good place to start because the videos are a great combination of visual and audio examples. Personally, there aren’t too many books we like when it comes to pure statistics. In the R programming section of this resource list, we will reference our favorite R + Statistics book. 2. Duke University On Coursera For a full course that are free you can try Duke Universities Statistics Course. This is actually several courses that cover multiple types of statistics like classical vs Bayesian. These are two different methods that are worth looking into. Python Videos, Books And Courses Python is an interesting topic. The thing about python is there are so many plausible sub-sections of the programming language. For instance, when we prepare for interviews we always like to clarify which type of python questions we will be dealing with. Will we be asked questions that focus on concepts that are operational, analytical, optimization based, algorithms and data structures or possibly data science algorithms. All of these are different topics that have different styles of interview questions. Getting a question on how to traverse a binary tree is very different from having to implement a decision tree algorithm. As a data scientist, typically you will benefit from the analytical and operational aspects of python. The operational portion will provide you with the ability to automate the boring stuff (<-3. as the cliche book is titled). This book is great for really…any data focused person. Data scientists, business analysts, Business Intelligence Engineers, and Database developers all can benefit from automation. Now, you don’t need to use python, if you’re in a windows environment there is PowerShell Linux has to bash. Learning some form of scripting language helps improve your workflow and design thinking. OUR FAVORITE PYTHON BOOKS, COURSES AND YOUTUBE VIDEOS Python is a common language that is used by both data engineers and data scientists. This is because it can automate the operational work that data engineers need to do and has the algorithms, analytics, and data visualization libraries required by data scientist. In both rolls, the need to manage, automate and analyze data is made easier by only a few lines of code. So much so that one of the books we have read and seen in many data focused practitioners libraries in the book Automate The Boring Stuff With Python. The book covers python basics and some simple automation tips. This is especially good for business analysts who work heavily in Excel. There are also books by O’Reilly that are also a great overview of the basics. Read More Here Are You Interested In Learning About Data Science Or Tech? Learning Data Science: Our Favorite Data Science Books What Is Data Science Really As Told By An Ex-FAANG Data Scientist How Algorithms Can Become Unethical and Biased How To Load Multiple Files With SQL How To Develop Robust Algorithms Dynamically Bulk Inserting CSV Data Into A SQL Server 4 Must Have Skills For Data Scientists SQL Best Practices — Designing An ETL Video As a data engineer, data scientist or just a general data team, how do you go from high level goal to a final data product? What is the process taken to successfully create a product that provides insights and action? Just to clarify a data product can be a model, a dashboard, a web app/API or even a simple excel output. But it all has to start somewhere. Note: In this series we will be walking through creating a data product. In this case we will be basing it on a company that has both online and in person sales. We will be creating videos, posts, designs, like an actual project. Going through the SQL, python, and other possible steps. If you are interested in keeping up with all our posts, then please sign up here. Intro: Let’s say we are a company like Ikea, Bernhardt or Dorel. A furniture company in the modern era would probably have an e-commerce website and in-person locations that sells furniture. A top level director or VP at this company will decide they want to increase sales of product category X or Y, or they want to improve sales in a specific region, or perhaps improve profit margins, reduce costs etc. This is where the ball starts rolling for metric and KPI development. High Level Goals The problem with metrics is you can track everything. You can track down to very specific granularities. But this doesn’t always provide value. In order to provide value you need to first define what is important to track. In order to know what to track you need to have a high level goal of what you are trying to improve. That is why the first step in the process is deciding what metrics will help track and support the business goals. In order to track and create these metrics the data team will need a general understanding of what the business team is trying to do. Nothing needs to be solidified, but it is good for the stakeholders to know what is important and what strategies they are looking to put into place. In addition, having some general goals like “we want to increase the average sale per person by 5%” is good because it makes it easier to foresee what possible strategies might work (but it also makes it easier to point out when the strategy failed). So step one is making sure these goals are written out somewhere as a proposal. It doesn’t have to be more than a page. It just clearly states the background of the goal, the why, etc. This will help the data team decipher what the business team is looking for. Why: We pointed this out earlier, but the reason it is important for the business team to have a clear idea of what they are trying to do is because there are so many metrics that could be tracked. This could cause a few issues. For instance, a good data team won’t start work until you have provided more context. However, a less experienced team might try to do the work and struggle to create anything of value. Instead, they might spend 6 months creating a product that has too many metrics, the wrong metrics or maybe they just never finish because it is really hard to make forward progress when you don’t know where you are going. This is why it is important in step one for the business team to have a clear ask. Just to show some of the metrics that could be tracked we have listed them out below. They all could provide value, but it depends on what the overall goal is. Metrics: There are a lot of possible metrics out there. All strange abbreviations, with different rules and all that somehow track very similar things from different angles. Let’s go over a few that the furniture store might be interested in using. Average revenue per user (ARPU) — This would be a pretty standard metric. However, if in this case we are adding in the idea of some form of ad-campaign then there is another angle we might want to consider. That is when the person was exposed to the ad. We will discuss how to analyze this in a later post
Average Purchases Per Week,Month,Year Per Store — This metric can be tricky. This can be very dependent on how much your average product costs, and the types of up-sell opportunities there are. For instance, a car is very expensive, very few people are going to buy two at once. However, people may be interested in spending $1000 dollars for a dvd player to be included or a specific set of rims, etc. For instance on Amazon you are buying products that are usually within that 20 -200 dollar range. That is easy to bundle an extra book or two, or a pan when you buy a spatula, or something like that. Customer Acquisition Cost (CAC) — This metric is used to calculate the cost to acquire new customers. It is important because one goal a company might have is to reduce the overall CAC. Depending on how this approached it can tell if ads are getting more effective or possibly if products are getting better. It would depend on what the company did prior to shift in CAC. Number Of New Visitors e.e. Acquisition — This would be a whole number and not a difficult metric to calculate. Percent Of New Users Compared To Base — Compared to the number of new visitor metric, viewing things as a percentage can tell you more fairly how your growth is going. These are just a few examples of metrics and we will continue to walk through this process. Up Next: In our next article we will start to discuss taking data from an operational database and moving it into a data warehouse. This is a key step into analyzing data because it makes the data available for data engineers and analysts. We will discuss ETLs, data models, etc. Please scroll to the bottom if you would like to sign up for our future articles and videos where we will continue the process of developing a data product. Are You Interested In Learning About Data Science Or Tech? 25 Of The Best Data Science Courses Online Learning Data Science: Our Favorite Data Science Books What Is Data Science Really As Told By An Ex-FAANG Data Scientist How Algorithms Can Become Unethical and Biased How To Load Multiple Files With SQL How To Develop Robust Algorithms Dynamically Bulk Inserting CSV Data Into A SQL Server 4 Must Have Skills For Data Scientists SQL Best Practices — Designing An ETL Video Data science and programming are such rapidly expanding specialities it is hard to keep up with all the articles that come out from Google, Uber, Netflix and one off engineers. We have been reading several over the past few weeks and wanted to share some of our top blog posts for this week April 2019! We hope you enjoy these articles. Building and Scaling Data Lineage at Netflix By: Di Lin, Girish Lingappa, Jitender Aswani Imagine yourself in the role of a data-inspired decision maker staring at a metric on a dashboard about to make a critical business decision but pausing to ask a question — “Can I run a check myself to understand what data is behind this metric?” Now, imagine yourself in the role of a software engineer responsible for a micro-service which publishes data consumed by few critical customer facing services (e.g. billing). You are about to make structural changes to the data and want to know who and what downstream to your service will be impacted. Read More Here DeepMind and Google: the battle to control artificial intelligence By Hal Hodson One afternoon in August 2010, in a conference hall perched on the edge of San Francisco Bay, a 34-year-old Londoner called Demis Hassabis took to the stage. Walking to the podium with the deliberate gait of a man trying to control his nerves, he pursed his lips into a brief smile and began to speak: “So today I’m going to be talking about different approaches to building…” He stalled, as though just realizing that he was stating his momentous ambition out loud. And then he said it: “AGI”. Read More Here Learning Data Science: Our Favorite Resources From Free To Not Today we wanted to cover some of our favorite resources for data science. As the title suggests, these resources will be from free to not. Some people like buying books and other people prefer online courses. So we have created this list of data resources that range from books to courses, from free to not. Data science has many facets. Statistics, data cleansing, programming, system design and really…almost anything else data related depending on how large the company is. This post will discuss our favorite resources for these topics. Now, most of these courses and books are primers for topics like statistics, Python and data science in general. They really will only provide the base knowledge. At the end of the day, real practical experience is one for the few things that will really train your data science knowledge. You should learn as much as you can from these resources and then apply for as many internships and entry-level positions as possible and study for interviews. Read More Here Object Detection with 10 lines of code By Moses Olafenwa One of the important fields of Artificial Intelligence is Computer Vision. Computer Vision is the science of computers and software systems that can recognize and understand images and scenes. Computer Vision is also composed of various aspects such as image recognition, object detection, image generation, image super-resolution and more. Object detection is probably the most profound aspect of computer vision due the number practical use cases. In this tutorial, I will briefly introduce the concept of modern object detection, challenges faced by software developers, the solution my team has provided as well as code tutorials to perform high performance object detection. Read More Here How Apache Airflow Distributes Jobs on Celery workers By Hugo Lime Discover what happens when Apache Airflow performs task distribution on Celery workers through RabbitMQ queues. Apache Airflow is a tool to create workflows such as an extract-load-transform pipeline on AWS. A workflow is a directed acyclic graph (DAG) of tasks and Airflow has the ability to distribute tasks on a cluster of nodes. Let’s see how it does that. Read More Here Capturing Special Video Moments with Google Photos Recording video of memorable moments to share with friends and loved ones has become commonplace. But as anyone with a sizable video library can tell you, it's a time consuming task to go through all that raw footage searching for the perfect clips to relive or share with family and friends. Google Photos makes this easier by automatically finding magical moments in your videos—like when your child blows out the candle or when your friend jumps into a pool—and creating animations from them that you can easily share with friends and family. Read More Here Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark Jobs By Scott Short As part of our effort to create better user experiences on our platform, members of our Maps Data Collection team use a dedicated mobile application to collect imagery and its associated metadata to enhance our maps. For example, our team captures images of street signs to improve the efficiency and quality of our maps data in order to facilitate a more seamless trip experience... Read More Here You created a machine learning application. Now make sure it’s secure. By Ben Lorica and Mike Loukides In a recent post, we described what it would take to build a sustainable machine learning practice. By “sustainable,” we mean projects that aren’t just proofs of concepts or experiments. A sustainable practice means projects that are integral to an organization’s mission: projects by which an organization lives or dies. These projects are built and supported by a stable team of engineers, and supported by a management team that understands what machine learning is, why it’s important, and what it’s capable of accomplishing. Read More Here Developing A Data Science Career Framework By Adam McElhinney At Uptake, Data Scientists are at the core of what we do. To that end, it’s very important that we have a good definition of the following: what does a Data Scientist do; how is a Data Scientist’s performance evaluated; and how does a Data Scientist progress in their career. Once you have these definitions, they can be used as the basis for all of your hiring, development, compensation, exit and promotion decisions. Read More Here Diagnosing Heart Disease Using ML Explainability Tools and Techniques By Rob Harrand IntroductionOf all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why. This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first used in a simple random forest model, and then the model is investigated using ML explainability tools and techniques. Read More Here Thank you so much for reading. If you are interested in getting updates about our favorite articles then sign up here for weekly newsletters. For further reading and videos on data science, SQL and Python: What REALLY is Data Science? Told by An Ex-Microsoft/FAANG Data Scientist How Algorithms Can Become Unethical and Biased How To Load Multiple Files With SQL How To Develop Robust Algorithms Dynamically Bulk Inserting CSV Data Into A SQL Server 4 Must Have Skills For Data Scientists SQL Best Practices — Designing An ETL Video |
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|