Photo by Tabea Damm on Unsplash Creating an effective data strategy is not as simple as hiring a few data scientists and data engineers and purchasing a tableau license. Nor is it just about using data to make decisions. Creating an effective data strategy is about creating an ecosystem where getting to the right data, metrics and resources is easy. It’s about developing a culture that learns to question data, and look at a business problem from multiple angles before making the final conclusion. Our data consulting team has worked with companies from billion dollar tech companies, to healthcare and just about every type of company in between. We have seen the good, the bad and the ugly of data being utilized for strategy. We wanted to share some of the simple changes that can help improve your companies approach to data. Find A Balance Between Centralized And Decentralized Practices Standards and over-centralization inevitably slow teams down. Making small changes to tables, databases and schemas might be forced to go through some overly complex process that keep teams from being productive. On the other hand, centralization can make it easier to implement new changes in strategy without having to go to each team and then force them to take on a new process. In our opinion, one of the largest advantages companies can gain is developing tools and strategies that help find a happy medium between centralized and decentralized. This usually involves creating standards to simplify development decisions while improving the ability to manage common tasks that every data team needs to perform like documentation and data visualization. While at the same time decentralizing decisions that are often department and domain specific. Here are some examples where there are opportunities to provide standardized tools and processes for unstandardized topics. Creating UDFs and Libraries For Similar Metrics After working in several industries including healthcare, banking and marketing one thing you realize is that many teams are using the same metrics. This could be across industries or at the very least across internal teams. The problem is every team will inevitably create different methods for calculating the exact same number. This can lead to duplicate work, code and executives making conflicting decisions because of top-line metrics that vary. Instead of relying on each team to be responsible for creating a process to calculate the various metrics you could create centralized libraries that uses the same fields to calculate the correct metrics. This standardizes the process while still providing enough flexibility for end-users to develop their reports based off their specific needs. This only works if the metrics are used consistently. For example in the healthcare industry metrics such as per patient per month costs (PMPM), readmission rates, or bed turn over rates are used consistently. These sometimes are calculated by EMR like EPIC, but might still be calculated by analysts again for more specific cases. It also might be calculated by external consultants. Creating functions or libraries that do this work easily can help improve consistency and save time. Instead having each team develop their own method you can simply provide a framework that makes it easy to implement the same metrics. Automate Mundane But Necessary Tasks Creating an effective data strategy is about making the usage and management of data easy. A part of this process requires taking mundane tasks that all data teams need to do and automating them. An example of this is creating documentation. Documentation is an important factor in helping analysts understand the tables and processes they are working with. Having good documentation allows for analysts to perform better analysis. However, documentation is often put off until the last minute or never done at all. Instead of forcing engineers to document every new table, a great idea is creating a system that automatically scrapes the available databases on a regular interval and keeps track of what tables exist, who created them, what columns they have, and if they have relationships to other tables. This would be a project for the devops team to take on, or you could also look into a third party system such as dbForge documentation for SQL Server. Now this doesn’t cover everything, and this tool in particular only works for SQL Server. But a similar tool can help simplify a lot of peoples lives. Teams will still need to describe what the table and columns are. But, the initial work of actually going through and setting up the basic information can all be automatically tracked. This can help reduce necessary but repetitive work that can help make everyones life a little easier. Provide Easier Methods To Share And Track Analysis This is very specifically geared towards data scientist. Data scientists will often do their work in Jupyter notebooks and Excel that they only have access to. In addition, many companies don’t enforce the need to use some form of repository like git so that data scientists can version control their work. This limits the ability to share files as well as keep track of changes that can occur in one’s analysis over time. In this situation, collaboration becomes difficult because co-workers are often stuck passing files back and forth and self version controlling. Typically that looks like files with suffixes like _20190101_final, _20190101_finalfile… For those of you who don’t get it, you hopefully never will have to. On top of this, since many of these python scripts utilize multiple libraries it can be a pain to ensure that as you pip install all the correct versions onto your environment. All of these small difficulties can honestly can cause the loss of a day or two due to trouble shoot depending on how complex the analysis is that you are trying to run. However, there are plenty of solutions! There are actually a lot of great tools out there that can help your data science teams collaborate. This includes companies like Domino Data Lab. Now, you can always use git and virtual environments as well, but this also demands that your data scientist be very proficient with said technologies. This is not always the case. Again, this allows your teams to work independently but also share their work easily. Data Cultural Shift Adding in new libraries and tools is not the only change that needs to happen when you are trying to create a company that is more data driven. A more important and much more difficult shift is cultural. Changing how people look and treat data is a key aspect that is very challenging. Here are a couple of reasons why. Data Lies For those who haven’t read the book, How To Lie With Statistics, spoiler alert, it is really easy to make numbers to tell the story you want. There are a lot of ways you can do this. A team can cherry pick the statistics they want to help their agenda triumph. Or perhaps a research team ignores confounding factors and reports on some statistic that seems to be shocking if you don’t consider all the other variables. Being data driven as a company means that you need to develop a culture that attempts to look at statistics and metrics and ensures there isn’t anything interfering with the number. This is far from easy. When it comes to data science and analytics. Most metrics and statistics often have some stipulations that could negate whatever message they are trying to say. That is why creating a culture that looks at a metric and asks why is part of the process. If it were as simple as just getting outputs and p-values. Then data scientists would be out of a job because there are plenty of third-party companies that have products that find the best algorithm and do feature selection for you. But that is not the only job of a data scientist. They are there to question every p-value and really dig into the why of the number they are seeing. Data Is Still Messy Truth be told, data is still very messy. Even with todays modern ERPs and applications, data is messy and sometimes bad data gets through that can mislead managers an analysts. This can be due to a lot of reasons. How the applications manage data, how system admins of those applications modified said system, etc. Even changes that seem insignificant from a business process side can majorly impact how data is stored. In turn, when data engineers are pulling data they might not accurately be representing data because of bad assumptions and limited knowledge. This is why just having numbers is not good enough. Teams also need to have a good sense of the business and the process that create said data to ensure they don’t allow data that is messy into the tables which analysts use directly. Our perspective is that data analysts need confidence that the data they are looking at correctly represents their corresponding businesses processes. If analysts have to remove any data, or consistently perform joins and where clauses to accurately represent the business, then the data is not “self-service”. This is why, whenever data engineers create new data models, they need to work closely with the business to make sure the correct business logic is collected and represented in the base layer of tables. That way, analysts can have near 100% trust in their data. Conclusion At the end of the day, creating an effective data culture requires a both top down and bottom up shift in thinking. From the executive level, decisions need to be made in what key areas they can help make access to data easier. Then teams can start working at becoming more proficient at actually using data to make decisions. We often find most teams spend too much time working on data tasks that need to get done but could be automated. Improving your companies approach to data can provide a large competitive advantage and allow your analysts and data scientists the ability to work on projects they both enjoy and help your bottom line! If you team needs data consulting help feel free to contact us! If you would like to read more posts about data science and data engineering, Check out the links below! Using Python to Scrape the Meet-Up API The Advantages Healthcare Providers Have In Healthcare Analytics 142 Resources for Mastering Coding Interviews Learning Data Science: Our Top 25 Data Science Courses The Best And Only Python Tutorial You Will Ever Need To Watch Dynamically Bulk Inserting CSV Data Into A SQL Server 4 Must Have Skills For Data Scientists What Is A Data Scientist
0 Comments
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientists at Galvanize Seattle. One student who came to hear our talk was Rebecca Njeri. Below, she shares tips on how to design a Data Science project. To Begin, Brainstorm Data Project Ideas To begin your data science project, you will need an idea to work on. To get started, brainstorm possible ideas that might interest you. During this process, go as wide and as crazy as you can, don’t censor yourself. Once you have a few ideas, you can narrow down to the most feasible/interesting idea. You could brainstorm ideas around these prompts: Questions To Help You Think Of Your Next Data Science Projects
Write a proposal: Write a proposal along the Cross Industry Standard Process for Data Mining (CRISP DM standards) which has the following steps: Business Understanding What are the business needs you are trying to address? What are the objectives of the Data Science project? For example, if you are at a telecommunications company, that needs to retain its customers, can you build a model that predicts churn? Maybe you are interested in using live data to help better predict what coupons to offer what customers at the grocery store. Data Understanding What kind of data is available to you? Is it stored in a relational or NoSQL database? How large is your data? Can it be stored and processed on your hard drive or will you need cloud services? Are the any confidentiality issues or NDAs involved if you are working in partnership with a company or organization? Can you find a new data set online that you could merge and increase your insights. Data Preparation This stage involves doing a little Exploratory Data Analysis and thinking about how your data will fit into the model that you have. Is the data in data types that are compatible with the model? Are there missing values or outliers? Are these naturally occurring discrepancies or errors that should be corrected before fitting the data into a model? Do you need to create dummy variables for categorical variables? Will you need all the variables in the data set are some dependent on each other? Modeling Choose a model and tune the parameters before fitting it to your training set of data. Python’s scikit learn library is a good place to get model algorithms. With larger data, consider using Spark ML. Evaluation Withhold a test set of data to evaluate the model performance. Data Science Central has a great post on different metrics that can be used to measure mode performance. The Confusion Matrix can help with considering the cost-benefit implications of the model’s performance. Deployment/Prototyping Deployment and implementation are some of the key components of any data driven project. You have to get past the theory and algorithms and actually integrate your data science solution into the larger environment. Flask and bootstrap are great tools to help you deploy your data science project to the world. Planning Your Data Science Projects
Keep a timeline with a To Do, In Progress, Completed and Parking section. Have a self-scrum(lol) each morning to see what you accomplished the previous day and set a goal for the new day. It could also help to get a friend with whom to scrum and help you keep track of your metrics. Goals and metrics can help you hold yourself accountable and ensure that you actually follow through and get your project done. Track your Progress Create a github repo for your project. Your proposal can be incorporated as the read me. Commit your work at the frequency which makes you comfortable, and keep track of how much progress you are making on your metrics. A repo will also make it easier to show your code to friends/mentors for a code review. Knowing When to Stop Your Project It may be good to work on your project with a minimum viable product in mind. You may not get all the things on your To Do list accomplished, but having an MVP can help you know when to stop. When you have learned as much as you can from a project, even if you don’t have the perfect classification algorithm, it may be more worthwhile to invest in a new project. Some Examples Of Data Driven Projects Below are some links to Github repos of some Data Science Capstones: Mememoji Predicting Change in Rental Price Units in NYC Bass Generator All the best with your new Data Science project! Feel free to reach out if you need someone to help you plan your new project. Want to be further inspired on your next data driven project! Check out some of our other data science and machine learning articles. You never know what might inspire you. Practical Data Science Tips Creatively Classify Your Data 25 Tips To Gain New Customer How To Grow Your Data Science Or Analytics Practice
Python is a great language for developers and scripters alike. It allows for some large scale design and OOP concepts. However, it was also developed to be very easy to read and design quick scripts! This is great, because data scientists don’t have all day to spend debugging. They do need to spend some time picking out which python languages will work best for their current projects. We at Acheron Analytics have written up a quick list of the 8 most used libraries that can help your next machine learning projects.
P.s....we had a busy week and couldn't get to an actual code example this week as we promised in our last post. However, we are working on that post! We will shortly have an example in R for a from scratch algorithm. Theano Theano, according to Opensource.com is one of the most heavily used machine learning libraries to date. The great things about Theano, is it is written leaning on mathematical concepts and computer algebra. When the code is compiled it has the ability to to match C level code. This is due to the fact that it is written to take advantage of how computer compilers work. This in short is how a computer parses and converts tokens into parse trees, how it optimizes and merges similar sub-graphs, using GPU for computations and several other optimizations. For the full list, check out the Theano main page. For those who used math based languages like Mathamatic and Matlab, the coding structure won’t seem to strange. What is great, is that Nvidia fully supports Theano and has a few helpful videos on how to use Theano and their GPUs. Pandas When it comes down to it. Machine learning and data science must have good data. How do you handle that data? Well, one great python library is Pandas. It was one of the first data languages many of us were exposed to at Acheron and still has a great following. If you are an R programmer, you will enjoy this language. It allows you to use data frames, which makes thinking about the data you are using much more natural. Also, if you are a SQL or RDBMS person, this language naturally fits with your tabular view of data. Even if you are more of a Hadoop or MongoDB follower, Pandas just makes life easier. It doesn’t stop there, it handles missing data, time series, IO and data transformations incredibly well. Thus, if you are trying to prepare your data for analysis, this python language is a must. We also wanted to share this great python cheat sheet we found, however, we would feel wrong just stick it on our blog. Instead, here is a link to the best python cheat we have found yet! This even beats Datacamp's cheat sheets! NumPy NumPy is another data managing library. Typically you see it paired with Tensorflow, SciPy, matplotlib and so many other python libraries geared towards deep learning and data science. This is because it is built to manage and treat data like matrices. Again, going back to Matlab and R. The purpose is to provide the ability to do complex matrix operations that are required by neural networks and complex statistics easily. Trying to handle those kind of operations in multi-dimensional arrays or lists is not the most efficient. Let's say you want to set up an identity matrix? That is one line of code in numpy. Everything about it is geared towards matrices and quick mathematical operations that are done in just a few lines. Coursea has a great course that you can use to further your knowledge about this library. How to code for an Identity Matrix: np.identity(3) array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]]) Scrapy This is an odd one out. Scrapy is not a mathematical language, it doesn’t perform data analysis, or deep learning. It does nothing you would think you would want to do in machine learning. However, it does one thing really well. That is, crawl the web. Scrapy is built to be an easy language to develop safe web crawlers (side note, make sure you read all the documentation, it is built to be a safe web crawling library if you configure it right and that is something you have to research). The web is a great source of unstructured, structured, and visual data. As long as a sight approves of you crawling and doesn’t mind you using their content(which we are not responsible for figuring out) you can gain a lot of insight into topics. You can use libraries that take words and put them into vectors to help perform analysis, or sentiment analysis, etc. It is much more difficult than using straightforward numbers. It is also much richer. There is alot to be gained fom pictures, words, and unstructured data. With that comes the task of getting that information how of the complex data. That being said Pattern is another specialized web mining scraper. It has tools for Natural Language Processing(NLP), and Machine Learning. It has several built in algorithms and really makes your life as developer much easier! scikit-learn We have discussed several libraries such as matplotlib, numPy and Pandas and how great they are for machine learning and data science. Now, imagine if you built and easy to use library on top of all of those, as well as several other easy to use libraries. Well, that is what scikit-learn is. It is a compilation of these libraries to create easy access to complex data science algorithms, data visualization techniques.It can be used for clustering, transforming data, dimensional reduction (reducing the number of features that exist), ensemble methods, feature selection and a lot of other classic data science techniques and they are all basically done in a few lines! The hardest part is making sure you have a virtual python library when you pip install!
matplotlib and ggplot
Now you have done all this analysis, and run all your algorithms. What now? How do you actually turn around value from all this data you have. How do you inspire your executives and tell them “Stories” full of “Insight” etc. If you don’t want to mess around with D3.js, python has you covered! Using Libraries like matplotlib and ggplot. Both are really built to mimic matlab and R functionality. Matplotlib has some great 3D graphs that will help you visualize your knn and PCA algorithms and clusters. When you are in your data exploration phase, hypothesis, and final product phase of a product. Using these three languages makes life much easier. You can visualize your data, its quirks and your final results! Tensorflow..again! We have discussed Tensorflow before on this blog when we talked about some common libraries used by data science professionals. It doesn't hurt to talk about it again though! The fact is, if you are in the world of machine learning, you have probably heard, tried, or implemented some form of deep learning algorithm. Are they necessary, not all the time. Are they cool when done right, yes. Tensorflow and Theano are very similar. The interesting thing about Tensorflow, is that when you are writing in python, you are really only design a graph for the compiler to compile into C++ code and then run on either your CPU or GPU. This is what makes this language so effective and easy to work with. Instead of having to write at the C++ or CUDA level, you can code it all in python first. The difficulty comes in actually understanding how to properly set up a neural network, convolutional network, etc. A lot of questions come into play, which type of model, what type of data regularization do you think is best, what level of data dropout or robustness do you want and are you going to purchase GPUs from Nvidia or try to make it work on CPUs?(Pending on your data size, you will most likely have to purchase, or pay for AI as a service tech from Google). These are just a few of the most commonly mentioned python libraries that are utilized by academics and professionals. Do you agree? Feel free to share what languages, libraries and tools you use, even if they aren’t python!
In the era of data science and AI, it is easy to skip over some crucial steps such as data cleansing. However, this can cause major problems in your applications later down in the data pipeline. The promise of possible magic like data science solutions can overshadow the necessary steps required to get to the best final product. One such step is cleaning and engineering your data before it even gets placed into your system. Truthfully, this is not limited to data science. Whether you are doing data analytics, data science, machine learning, or just old fashioned statistics, data is never whole and pure before refining. Just like putting bad unprocessed petroleum into your car, putting unprocessed data into your company's systems will either immediately, or eventually wreak havoc(Here are some examples). Whether that means actually causing software to fail, or giving executives bad information both are unacceptable.
We at Acheron Analytics wanted to share few tips to ensure that whatever data science/analytics projects you are taking on, you and your team are successful. This post will go over have some brief examples in R, Python and SQL, feel free to reach out with any questions. Duplicate Data Duplicate data is the scourge of any analyst. Whether you are just using excel, Mysql, or Hadoop. Making sure your systems don’t produce duplicate data is key. There are several sources to duplicate data. The first comes from when the data is input into your companies data storage system. There is a chance that the same data may try to sneak its way in. This could be due to end-user error, a glitch in the system, a bad ETL, etc. All of this should be managed by your data system. Most people still use RDBMS and thus, using a unique key will avoid duplicates being inserted. Sometimes, this may require a combination of fields to check and see if the data being input is a duplicate. For instance, if you are looking at a vendor invoice line item, you probably shouldn’t have the same line item number and header id twice. This can become more complicated when line items change(but even that can be accounted for). If you are analyzing social media post data, each snapshot you take may have the same post id but have altered social interaction data (likes, retweets, shares, etc). This references slowly changing dimensions, which, is another great topic for another time. Feel free to read up more on the topic here. In both cases, your systems should be calibrated to safely throw out the duplicate data and store the errors in some error table. All of this will save your team time and confusion later. Besides the actual source data itself having duplicates. The other common duplicate that can occur is based off an analyst's query. If, by chance, they accidentally don’t have a 1:1 or 1 : Many relationship on the key they are joining on, they may find themselves with several times the amount of data you started with. This could be as simple as restructuring your team's query to make sure they properly create 1:1 relationships, or...you may have to completely restructure your database. It is more likely the former option. How to Get Rid of Duplicate Data in SQL
Missing Data Has your company ever purchased data from a data aggregator and found it filled with holes? Missing data is common across every industry, sometimes it is just due to system upgrades and new features being added in, sometimes just bad data gathering. Whatever it might be, this can really skew a data science projects results. What are your options then? You could ignore rows with missing data, but this might cost your company valuable insight and including the gaps will produce incorrect conclusions. So, how do you win? There are few different thoughts on this. One is to simply put a random and reasonable number in place of nothing. This doesn’t really make sense, as it is difficult to really tell what is being driven by what feature. What is a more common and reasonable practice is using the data set average. However, even this is a little misleading. For instance, on one project we were involved with, we were analyzing a large population of users and their sociometric data(income, neighborhood trends, shopping habits). About 15% of the data was missing that was purchased from a credit card carrier. So throwing it away was not in our best interest. Instead, because we had the persons zipcodes, we were able to aggregate at a local level. This was a judgement call. A good one in this case. We compared this to averaging the entire data set, and we really got a much clearer picture on our populations features. The problem with a general average over several hundred thousand people is that you will eventually have some odd sways. For instance, income, if your data set is a good distribution, you will end up with your average income being, well, average. Then, suddenly, people that may have lived in richer neighborhoods may suddenly create their own classification. The difference between 400k vs 50k(even when normalized) can drastically alter the rest of the features. Does it really make sense for someone who is making 50K a year to be purchasing over 100k of products a year? In the end, we would get a strange cluster that was large spenders, who made average income. When your focus is socio-economic factors. This can cause some major discrepancies. How to Handle Missing Data with SQL
Data Normalization
Data normalization is one of the first critical steps to making sure your data sensible to run in most algorithms. Simply trying to feed in variables that could be anything from age, income, computer usage time, etc, creates the hassle of trying to compare apple to oranges. Trying to input 400k to 40 years will create bad outputs. The numbers just don’t scale. Instead, the concept of normalization allows your data to be more comparable. It takes the max and min of a data set and sets them to the 0 and 1 of a scale. Now, the rest of the numbers can be scaled. Utilizing 0-1 allows your data science teams to meld the data smoother. They are no longer trying to compare scales that don't match. This is a necessary step in most cases to ensure success. R Progamming Normalization
Python NormalizationPython(This can also depend on whether you are using Numpy, Pandas, etc)
Final Thoughts Data preparation can be one of the longer steps when preparing your teams data science project. However, once the data is cleaned, checked, and properly shaped, it is much easier to pull out features, and create accurate insights. Preparation is half the battle. Once the data is organized, it becomes several times easier to mold. Good luck with your future data science projects and feel free to give us a ring here in Seattle if you have more questions about your data science projects Future Learning! And Other Data Transformations We wanted to supply some more tools to help you learn how to transform and engineer your data. Here is a great video that covers several data transforms. This particular video relies on the R programming language. |
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|