When developing predictive models and algorithms, whether linear regression or ARIMA models it is important to quantify how well the model fits to the future observations. One of the simplest methods of calculating how correct a model is uses the error between the predicted value and the actual value. From there, there are several methodologies that take this difference and further exploit meaning from it. Quantifying the accuracy of an algorithm is an important step to justifying the usage of the algorithm in product. We will be using the function accuracy from the R programming language as our basis. The output is depicted below, as you may notice, it has several abbreviations that might not seem so friendly. We will be going through some of them below. In addition, you can watch us explain the same errors in video format in R Studio! Mean Absolute Error(MAE) The mean absolute error is one of the simpler errors to understand. It takes the absolute difference between the actual and forecasted values and finds the average. Finding the absolute value is important because it doesn’t allow for any form of cancellation of error values. For instance, if you were to take the average of 1 and -1 then you would have an average value of 0 because the 1 and -1 would essentially cancel each other out. To avoid this we use the absolute value. Now we wanted to demo how you find the MAE both mathematically and using SQL. You could use the formula below for SQL and it would find the same value as the MAE. Plus, we feel like it might simplify all the complex math symbols you see in the next image. Avg(Abs(Actual — Forecast)) Root Mean Squared Error (RMSE) The root mean squared error seems somewhat similar to the MAE. They both take the difference between the actual and the forecast. However, the RMSE also then squares the difference, finds the average of all the squares and then finds the square root. Now it might seem like the action of squaring and then taking the square root may cancel each other out. This isn’t the case. The RMSE essentially punishes larger errors. Another way to phrase that is that it puts a heavier weight on larger errors. For example, let’s compare the two tables below. If you notice, the MAE and RMSE are nearly identical for both table 1 and table 2. However, the difference between the two values, even when increase in error is only 1 gets slightly larger as denoted in the first row. If the error were 5, 6, or another larger number, the difference between the RMSE and MAE would grow even larger. This is because you square the number. This creates an exponential change in the base number. Thus, an error difference of 1 has a greater effect for every increase e.g. from (3 to 4 then from 4 to 5). This is why it essentially punishes larger errors. Below is again the SQL and mathematical notation of the RMSE. Sqrt(Avg(Power(Actual -Forecast))) Mean Absolute Percentage Error(MAPE) The one issue you may run into with both the RMSE and MAE is that both values can just become large numbers that don’t really say all that much. What does a RMSE of 597 mean? How bad or good is that? Part of this is because you need to compare it to other models. Another issue is the fact that the RMSE will be based off the difference of the actual and forecast, which depending on your data could be on on very different scales. For instance, if you are creating a model for a billion dollar corporation your error will be much larger than one for a company that only grosses 6 figures. In this case, the mean absolute percentage error is good method in the sense that it is the percentage of the error compared to the actual value. This provides more of a standardized error measure. For instance, if the error was 10 and the actual value was 100, then the percentage would be 10% compared to if the error was 100 and the actual value was 1000, the measure would still be 10%. This provides a little more context than the RMSE and the MAE which can help better explain the model’s accuracy. The SQL and mathematical notation are listed below Avg(Abs(Actual-Forecast)/Abs(Actual)) *100 Mean Absolute Scaled Error(MASE) The mean scaled error is the last error that we will be discussing today. The MASE is slightly different than the other three. It compares the MAE of your current model you are testing to the MAE of the naive model. The naive model just forecasts the previous observation to the current observation. The MASE is the ratio of the MAE over the MAE of the naive model. In this way, when the MASE is equal to 1 that means that your model has the same MAE as the naive model, so you almost might as well pick the naive model. If the model’s MASE is .5, that would suggest that your model is about 2x as good as just picking the previous value. This error skips the step of running several models and instead automatically compares your model to another one. It provides a little more context than the MAE, RMSE and MAPE. Overall, these four errors create a story that can help decide whether your algorithm or model is a good fit. There are still other factors to consider, but I do hope this helped simplify these strange abbreviations. If you have any more questions, or have other statistics or programming questions. Please feel free to reach out! Other great read about data science: What is A Decision Tree How Algorithms Can Become Unethical and Biased How Men’s Wearhouse Could Use Data Science Cont Introduction To Time Series In R How To Develop Robust Algorithms 4 Must Have Skills For Data Scientists
7 Comments
Our team wanted to start to introduce the concept of time series and forecasting in R. This first video is an intro to the time series object in R. It will go over the various parameters the time series object has and discuss some of the nuances that can be defined in those parameters. We will be focusing on creating an entire series that discusses more complex models like ARIMA, ETS and other forecasting models that can be used to better predict time series. If you have any questions on R or seasonal forecasting, please feel free to reach out! If You're interested in reading more about data science:
How To Grow As A Data Scientist Boosting Bagging And Building Better Algorithms How To Survive Corporate Politics As A Data Scientist 8 Top Python Libraries For Machine Learning What Is A Decision Tree Fraud costs companies billions of dollars every year. Medical fraud alone is calculated to cost the US 68 billion dollars annually. Relying only on manual detection of fraud is rarely a cost effective remedy to this problem. In fact, in many cases it costs more to find fraud, than the remuneration received once the case is finished. This is far from efficient and makes it very difficult for companies to approach fraudulent claims and transactions in any industry. Going beyond the hype and sizzle of big data and analytics, fraud detection is where these buzzwords provide major ROI. In the world of big data, it is key to utilize hybrid processes that involve developing algorithms to help quickly sort out the claims or transactions that are the most likely to be fraudulent. This can help save human resources hundreds to thousands of hours of manual search time. Thus, allowing your team to focus their efforts on other pressing issues. Our team has experience developing these systems that can help pinpoint transactions that need to be analyzed and decrease the man hours required to tread through all of your transactions. Call To Action If your department is looking to develop a system to better classify fraud, then contact us today! We would love to help develop your product. Or background in data engineering, data analytics and data science makes us a As data grows the ability for your employees to turn that data into cost savings and revenue makes a huge difference in your bottom line. Upskilling your workers can drastically increase the value they can provide your teams and your entire company. SQL, Python, R and Tableau are just a few modern analytical tools that can help your employees more efficiently discover hidden value in this data driven world. As data grows in volume, velocity and variation, being able to wrangle data skillfully provides a large impact for a business. Using these tools effectively can provide insights into customers, supply chains, fraudulent behaviors and a list of other business problems that open up opportunities for your company. Our team has developed course work to help upskill your teams so they can take advantage of these tools and new data sources. We use real life examples from healthcare, IT operations, HR, accounting, etc in order to challenge your teams and help them further develop their analytical tool belts. Call To Action If your department is looking to develop your employees analytical abilities with Python, R, SQL, or other data analytical tool, then contact us today! We would love to help invest in your employees abilities. Engineering accurate and robust data products like visualizations, algorithms and reports are a key step to informing directors and managers so that they can make decisions confidently. These data analytics products often require multiple phases including a data engineering phase, a research phase and a design phase. However, most analytics teams are backed up with operating activities to that bars them from developing these new valuable insights. Our team has experience successfully developing multiple products from fraud detection to forecasting and predictive analytics(to name a few). We have engineered and implemented products into various systems and understand the value good data analytics can provide your company.
Call To Action If your department is looking to develop a new data analytics product then contact us today! We would love to help develop your product. The Goals Of This Service Are:
Accurately forecasting costs, sales, user growth, patient readmission, etc is an important step to providing directors actionable information. This can be difficult to model by hand or in Excel. In addition, using traditional methods like moving averages might not provide enough insight into the various trends and seasonality. Using models like the ARIMA and ETS provides analysts the ability to predict more accurately and robustly by considering multiple factors like seasonality and trend. What is even better is that languages like R and Python make it much easier for analysts and data teams to avoid all the work they would usually have to do by hand. This can reduce the time to develop a model by more than half and increase accuracy. Our team has developed a course to help upskill your analysts in the skills of R programming, ARIMA and ETS. It covers not only the programming aspect, but also helps cover many of the important topics that have to do with time series forecasting like stationarity, autocorrelation and unit roots. This class will help educate your team and improve their ability to use R and develop models for forecasting. Call To Action If your department is looking to develop an improve your forecasts and upskill your employees contact us today! We would love to help instruct you and your teams The Goals Of This Service Are:
|
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|