Web scraping and utilizing various APIs are great ways to collect data from websites and applications that can later be used in data analytics. There is a company called HiQ that is well known for web scraping. HiQ crawls various "Public" websites to collect data and provide analytics for companies on their employees. They help companies find top talent using sites data like Linkedin, and other public sources to gain the information needed in their algorithms.
However, they ran into legal issues when Linkedin asked them to cease and desist as well as put in certain technical methods to slow down HiQ's web crawlers. HiQ subsequently sued Linkedin and won! The judge said as long as the data was public, it was scriptable! This was quiet the blow for scrapers in general. So how can your company take advantage of online public data? Especially when your team might not have a programming background.
Web scraping typically requires a complex understanding of HTTP requests, faking headers, complex Regex statements, HTML parsers, and database management skills.
There are programming languages that make this much easier such as Python. This is because Python offers libraries like Scrapy and BeautifulSoup that make scraping and parsing HTML easier than old school web scrapers. However, it still requires proper design and a decent understanding of programming and website architecture. Let's say your team does not have programming skills. That is ok! One of our team members recently gave a webinar at Loyola University to demonstrate how to scrape web pages without programming. Instead, Google sheets offer several useful functions that can help scrape web data. If you would like to see the video of our webinar it is below. If not, you can continue to read and figure out how to use Google Sheets to scrape websites. The functions you can use for web scraping with google sheets are:
All of these functions will scrape websites based off of different parameters provided to the function. Web Scraping With ImportFeed The ImportFeed Google Sheet function is one of the easier functions to use. It only requires access to Google Sheets and a URL for a rss feed. This is a feed that is typically associated with a blog. For instance, you could use our RSS feed "http://www.acheronanalytics.com/2/feed". How do you use this function? An example is given below. "=ImportFeed( "http://www.acheronanalytics.com/2/feed") That is all that is needed! There are some other tips and tricks that can help clean up the data feed as you will get more than just one column of information. For now, this is a great start at web scraping. Do The Google Sheet Import Functions Update? All of these import function automatically update data every 2 hours. A trigger function can be set to increase the cadence of updates. However this requires more programming. This is it in this case! From here, it is all about how your team uses it! Make sure you engineer a solid data scraping system.
Web Scraping With ImportXML The ImportXML function in Google Sheets is used to pull out specific data points using HTML ids, and classes. This requires some understanding of HTML and parsing XML. This can be a little frustrating. So we created a step by step for web scraping for HTML. Here are some examples from an EventBrite page.
The truth about using this function is that it requires a lot of time. Thus, it requires planning and designing a good google sheet to ensure you get the maximum benefit from utilizing. Otherwise, your team will end up spending time maintaining it, rather than working on new things. Like in the picture below
Web scraping With ImportHTML
Finally we will discuss ImportHTML. This will import a table or list from a web page. For instance, what if you want to scrape data from a site that contains stock prices. We will use the http://www.nasdaq.com/symbol/snap/real-time. There is a table on this page that has the stock prices from the past few days. Similar to the past functions you need to use the URL. On top of the URL, you will have to mention which table on the webpage you want to grab. You can do this by utilizing the which number it might be. An example would be ImportHTML("http://www.nasdaq.com/symbol/snap/real-time",6). This will scrape the stock prices from the link above. In our video above, we also show how we combine scraping the stock data above and melded it with news about the Stock ticker on that day. This could be utilized in a much more complex manner. A team could create an algorithm that utilizes the stock price of the past, as well as new articles and twitter information to choose whether to buy or sell stocks. Do you have any good ideas of what you could do with web scraping? Do you need help with your web scraping project? Let us know! Other great read about data science: What is A Decision Tree How Algorithms Can Become Unethical and Biased Intro To Data Analysis For Everyone Part 1 Why Invest In A Data Warehouse?
3 Comments
Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientists at Galvanize Seattle. One student who came to hear our talk was Rebecca Njeri. Below, she shares tips on how to design a Data Science project. To Begin, Brainstorm Data Project Ideas To begin your data science project, you will need an idea to work on. To get started, brainstorm possible ideas that might interest you. During this process, go as wide and as crazy as you can, don’t censor yourself. Once you have a few ideas, you can narrow down to the most feasible/interesting idea. You could brainstorm ideas around these prompts: Questions To Help You Think Of Your Next Data Science Projects
Write a proposal: Write a proposal along the Cross Industry Standard Process for Data Mining (CRISP DM standards) which has the following steps: Business Understanding What are the business needs you are trying to address? What are the objectives of the Data Science project? For example, if you are at a telecommunications company, that needs to retain its customers, can you build a model that predicts churn? Maybe you are interested in using live data to help better predict what coupons to offer what customers at the grocery store. Data Understanding What kind of data is available to you? Is it stored in a relational or NoSQL database? How large is your data? Can it be stored and processed on your hard drive or will you need cloud services? Are the any confidentiality issues or NDAs involved if you are working in partnership with a company or organization? Can you find a new data set online that you could merge and increase your insights. Data Preparation This stage involves doing a little Exploratory Data Analysis and thinking about how your data will fit into the model that you have. Is the data in data types that are compatible with the model? Are there missing values or outliers? Are these naturally occurring discrepancies or errors that should be corrected before fitting the data into a model? Do you need to create dummy variables for categorical variables? Will you need all the variables in the data set are some dependent on each other? Modeling Choose a model and tune the parameters before fitting it to your training set of data. Python’s scikit learn library is a good place to get model algorithms. With larger data, consider using Spark ML. Evaluation Withhold a test set of data to evaluate the model performance. Data Science Central has a great post on different metrics that can be used to measure mode performance. The Confusion Matrix can help with considering the cost-benefit implications of the model’s performance. Deployment/Prototyping Deployment and implementation are some of the key components of any data driven project. You have to get past the theory and algorithms and actually integrate your data science solution into the larger environment. Flask and bootstrap are great tools to help you deploy your data science project to the world. Planning Your Data Science Projects
Keep a timeline with a To Do, In Progress, Completed and Parking section. Have a self-scrum(lol) each morning to see what you accomplished the previous day and set a goal for the new day. It could also help to get a friend with whom to scrum and help you keep track of your metrics. Goals and metrics can help you hold yourself accountable and ensure that you actually follow through and get your project done. Track your Progress Create a github repo for your project. Your proposal can be incorporated as the read me. Commit your work at the frequency which makes you comfortable, and keep track of how much progress you are making on your metrics. A repo will also make it easier to show your code to friends/mentors for a code review. Knowing When to Stop Your Project It may be good to work on your project with a minimum viable product in mind. You may not get all the things on your To Do list accomplished, but having an MVP can help you know when to stop. When you have learned as much as you can from a project, even if you don’t have the perfect classification algorithm, it may be more worthwhile to invest in a new project. Some Examples Of Data Driven Projects Below are some links to Github repos of some Data Science Capstones: Mememoji Predicting Change in Rental Price Units in NYC Bass Generator All the best with your new Data Science project! Feel free to reach out if you need someone to help you plan your new project. Want to be further inspired on your next data driven project! Check out some of our other data science and machine learning articles. You never know what might inspire you. Practical Data Science Tips Creatively Classify Your Data 25 Tips To Gain New Customer How To Grow Your Data Science Or Analytics Practice Recently, our team of data consultants had an awesome opportunity to present to a class of future data scientist at Galvanize Seattle. It was a lot of fun and we met a lot of ex-software developers and IT specialists. One student who had come to hear our talk was named Rebecca Njeri. She did not have a background in software engineering. However, she was clearly well adapted to the new world. In fact, for one of her projects she used company data to create a recidivism prediction model among former inmates using supervised learning models. How do Machine Learning Algorithms Learn Bias? There are funny mishaps that result from imperfectly trained machine learning algorithms. Like my friend’s iPhone classifying his dog as a cat. Or these two guys stuck on a voice activated elevator that doesn’t understand their accent. Or maybe Amazon’s Alexa trying to order hundreds of dollhouses because it confuses the news anchor’s report for a request from its owner. There are also the memes on the Amazon Whole Foods purchase, which are truly in the spirit of defective algorithms. “Bezos: "Alexa, buy me something from Whole Foods." Alexa: "Buying Whole Foods." Bezos: "Wait, what?"” The Data Science Capstone For my final capstone for the Galvanize Data Science Immersive, I spent a lot of time exploring the concept of algorithmic bias. I had partnered with an organization that helps former inmates go back to school, and consequently lowers their probability to recidivate. The task I had was to help them figure out the total cost of incarceration, i.e. both the explicit and implicit costs of someone being incarcerated. While researching this concept, I stumbled upon Propublica’s Machine Bias essay that discusses how risk assessment algorithms contain racial bias. I learnt that an algorithm that returns disproportionate false positives for African Americans is being used to sentence them to longer prison sentences and deny them parole, that tax dollars are being spent on incarcerating people who would be out in the society being productive members of the community, and that children whose parents shouldn’t be in prison are in the foster care system. An algorithm that has disparate impact is causing people to lose jobs, their social networks, and ensuring the worst cold start problem once someone has been released from prison. At the same time, people likely to commit crimes in the future are let to go free because the algorithm is blind to their criminality. How do these false positives and negatives occur and does it matter? To begin with, let us define three concepts related to the Confusion Matrix: precision, recall, and accuracy. Precision Precision is the percentage of correctly classified true positives as a percentage of the positive predictions. High precision means that you correctly label as many of the true positives as possible. For example, a medical diagnostic tool should be very precise because not catching an illness can cause an illness to worsen. In such a time sensitive situation, the goal is to minimize the number of false negatives returned. Similarly, if a security breach from one of your employees is pending, you’d like a precise model to predict who the culprit will be to ensure that a) You stop the breach, and b) have the minimal interruptions to your staff trying to find this person. Recall Recall on the other hand is the percentage of relevant elements returned. For example, if you search for Harry Potter books on Google, recall will be the number of Harry Potter titles returned divided by seven. Ideally we will have a recall of 1. In this case, it might be a nuisance, and a terrible user experience to sift through irrelevant search results. Additionally, if a user does not see relevant results, they will likely not make any purchases, which eventually could hurt the bottom line. Accuracy Accuracy is a measure of all the correct predictions as a percentage of the total predictions. Accuracy does poorly as a measure of model performance especially where you have unbalanced classes. For precision, recall, accuracy, and confusion matrices to make sense to begin with, the training data should be representative of the population such that the model learns how to classify correctly. Confusion matrices Confusion matrices are the basis of cost-benefit matrices, aka the bottom line. For a business, the bottom line is easy to understand through profit and loss analysis. I suppose it’s a lot more complex to determine the bottom line where discrimination against protected classes is involved. And yet, perhaps it is more urgent and necessary to do this work. There is increased scrutiny on the products we are creating and the biases will be visible and have consequences for our companies. Machine Learning Bias Caused By Source Data The largest proportion of machine learning is collecting and cleaning the data that is fed to a model. Data munging is not fun, and thinking about sampling and outliers and population distributions of the training set can be boring, tedious work. Indeed, machines learn bias from the oversights that occur during data munging. With 2.5 exabytes of data generated every day, there is no shortage of data on which to train our models. There are faces of different colors, with and without glasses, wide eyes and narrow eyes, brown eyes and green eyes. There are male and female voices, and voices with different accents. Not being culturally aware of the structure of the data set can result in models that are blind or deaf to certain demographics thus marginalizing part of our use groups. Like when Google mistakenly tagged black faces as an album of gorillas. Or when air bags meant to protect passengers put women at risk of death during an accident. These false positives, i.e. the conclusion that you will be safe when you will actually be at risk cost people’s lives. Human Bias Earlier this year, one of my friends, a software engineer asked the career adviser if it would be better to use her gender neutral middle name for her resume and LinkedIn to make her job search easier. Her fear isn’t baseless; there are unsurmountable conscious and unconscious gender biases at the workplace. There was even a case where a man and woman switched emails for a short period and saw drastic differences in the way they were being treated. How to Reduce Machine Learning Bias However, if we are to teach machines to crawl LinkedIn and resumes, we have the opportunity to scientifically remove the discrimination we humans are unable to overcome. Biased risk assessment algorithms result from models being trained on data that is historically biased. It is possible to intervene and address the historical biases contained in the data such that the model remains aware of gender, age and race without discriminating against or penalizing any protected classes. The data that seeds a reinforcement learning model can lead to drastically excellent or terrible results. Exponential improvement, or exponential depreciation could lead to increasingly better performing self driving cars that improve with each new ride, or they could convince a D.C. man of the truth of a non-existent sex trafficking ring in D.C. How do machines learn bias? We teach machines bias through biased training data. If you enjoyed this piece on data science and machine learning. Feel free to check out some of our other works! Why Data Science Projects Fail When Data Science Implementation Goes Wrong Data Science Consulting Process Data science projects fail all the time! Why is that? Our team of data science consultants have seen many good intentions go wrong because of failure to empower data science teams, locking away access to data, focusing on the wrong problem, and many other problems that could be avoided! We have written 32 of the reasons we have seen data science projects fail. We are sure there are more and would love to get comments on what your teams have seen! What makes a data science project team succeed? 1. The data scientists aren’t given a voice Data science and strategy can play very nicely together when allowed! Data scientists are more than just over glorified analysts! They have access to possibly all the data a company owns! That means they know every movement the company has made with every outcome (if the data was stored correctly). However, they are often left in the basement with the rest of the tech teams forced to push out reports like any other report developer. There is a reason companies like Amazon, and Google continue to do so well! It is because the people with the Data have a voice! 2. Starting with the wrong questions. Let’s face it. Most technology people often focus more on how cool a project is, not how much money it will save the company. This can sometimes lead to the wrong business questions being answered! This will lead to a team quickly either failing, or losing value inside of the company. The goal should be to do as much to hit high value business targets as possible. That is what keeps data science projects from failing or at least, being unnoticed. 3.Not addressing the root cause just trying to improve the effect of a process One of the most dubious and hard to spot until it is too late is not realizing a data science team wasn’t even looking at the actual cause of the problem. When our data science team comes in, one of the things we assess is how a data science team develops their hypotheses. How far do they dig in the data, how many false hypotheses do they think of. How about other causations that could cause a similar output. An outcome can have a very deep root. 4. Weak stakeholder buy-in Any project, data science, machine learning, construction, or any other department will fail without stakeholder buy in! There needs to be an executives to own the project. This gives a team acknowledgement for their hard work and it also ensures that there will be funding! Without funding, a project will come to a dead halt. 5.Lack of access to data Slightly attached to the previous point. Locking access away from data scientists, whether it be tools or data is just a waste of time. If a data scientists is forced to spend all day begging a DBAs for access, don’t expect projects to finish any time soon! 6. Using Faulty/Bad Data Any data specialist (data engineer, analyst, scientist, architect) will tell all managers the cliche saying. Garbage in, garbage out! If the data science team trains a machine learning model on bad data, then it will get bad results. There is no way around it! Even if an algorithm works with 100% accuracy, if all of the data classification is incorrect, then so are the predictions. This will lead to a failed project and executives no longer trusting the data science team. 7.Relying on Excel as the main data storage….or Access As data science consultants, our team members have come across plenty of analytics and data science projects. Often times, because of lack of support, data scientists and analyst have to construct make shift storage centers because they are not given a sandbox or server to work on. Excel and Access both have their purposes. One of them is not managing large sets of data for analytics purposes. Don’t do that to a data scientists. This will just get poorly designed systems and high turn over! 8. Having a data scientist build their own ETLs We have seen ETL systems built from R because instead of getting an expert ETL developer a company was allowing the poor data scientists a crack at it. Don’t get us wrong, data scientists are smart people. However, you would much rather have them focus on algorithms and machine learning program implementations instead of spending all day engineering their own data warehouses. 9. Lack of diverse Subject Matter Experts Data scientists are great with data and often a few subjects that revolve around the data they have worked with. However, data, and businesses are so very different. Sometimes this means a company needs to partner the data science experts with experts. Otherwise, they won’t have the context to better understand complex subjects like manufacturing, pharmaceuticals and avionics. 10.Poorly assessing a team's skills and knowledge of data science tools If a data science team doesn’t have the skills to work with Hadoop, why would you set up a cluster? It is always good to be aware of a teams skill set first. Otherwise they won’t be able to produce products and solutions at the highest level. Data science tools vary, so make sure you look round before you make any solid decisions. 11.Using technologies because they are cool and not useful Just because you can use certain tools for a problem. Doesn’t mean it is always the best option. We wouldn’t recommend R for every problem. It is great for research type problems that don’t need to be implemented. If you want a project to get implemented into a larger system, than python or even C++ might be better(depending on the system). Same things goes for Hadoop, or MySQL, or Tableau and Power BI. They all have a place. Don’t let a team do something, just because they can. 12. Lacking an experienced data science leader Data science is still a new field. That doesn’t mean you don’t need a leader who has some experience working on a data science team. Without one that has a basic understanding of good data science practices. A data science team could struggle to bring projects to fruition. They won’t have a roadmap for success, they will have bad processes and this will just lead to a slew of other problems. 13. Hiring a scientists with limited business understanding Technology and business are two very different disciplines and sometimes this leads to employees knowing one subject really well and failing to know the other at all. This is ok if a small percentage of the data science team are built up of purely research based employees. It is important to note that some of them should still be very knowledgable of how to act in a business. If you want to help them get up to speed quickly. Check out this list of “How To Survive Corporate Politics as a data scientist”. 14. A boss read one of our blog posts and now thinks he can solve world hunger Algorithms can’t solve every problem, at least not easily! If this were true, a lot more problems would be solved by now. Having a boss who simply went to a data science conference and now believes he or she can push the data science team to solve every business gap is not reasonable. Limited resources, complexity of subjects, and unstable processes can quickly destroy any project. 15. The solutions are too complex One mistake executives and data scientists make is thinking their data science models should be complex. It makes sense right, data science is a complex, statistics based subject. This is not true all the time! The simpler you can build a model, or integrate a machine learning solution means a data team will have an easier time maintaining the algorithm in the future. 16. Failing to document Most technology specialist dislike documentation. It takes time, and it isn’t building new solutions. However, without good documentation, they will never remember what they did 1 month ago, let alone a year ago. This means tracking bugs, tracking how programs work, common fixes, play books, the whole nine yards. Just because data science teams aren’t technically software engineering teams, it doesn’t mean they can step away from documenting how their algorithms work and how they can to their conclusions. 17. The Data science team went with every new request from stakeholders(scope creep). As with any project, data science teams are susceptible to scope creep. Their stakeholders demand new features every week. They add new data points, and dashboard modules. Suddenly, the data science project seems like it can never be finished. You have half a team focused on a project that managers can’t make their minds up on. Then it will never succeed. 18. Poorly designed models that are not robust or maintainable : Even well documented bad systems lead to quick failures. Data science projects have lots of moving pieces. Data flowing through ETLs, dashboards, websites, automated report, QA suites, and so one. Any piece of these can take a while to develop, and if developed badly even longer to fix! Nothing is worse then spending an entire FTE on maintaining systems that should be able to run automatically. So spend enough time planning up front that you are not stuck with terrible legacy code. 19. Disagreement on enterprise strategy. When it comes down to it. Data science offers a huge advantage when implemented well for corporate strategy. That also means the projects being done by some of the more experienced data scientists need to closely align with a directors and executives strategy. Strategies change, so these projects need to come out fast and be focused on maximizing the decisions making of executives. If you are producing a dashboard focused on growth, but an executive team is trying to focus on rebranding, you are wasting time and money! 20. Big data silos or vendor owned data! You know what is terrible. When data is owned by a vendor. This makes it so hard for data science teams to actually analyze their companies data. Especially if the vendor offers a bad API, none at all or worse, they charge you just to use it. To get a company's data! Imagine, a poor data science budget going to buy back the data! Similarly, if all the data is in silos. It is almost impossible for a data scientists to bring it all together. There are rarely crosswalks or data standards so they are often stuck hopelessly starring at lots of manual work to make data relate. 21 . Problem avoidance(Ignoring the elephant in the room!) We have all done it! Even data scientists! We know the company has a major problem, it’s the elephant in the room and it could be solved. However, it might be part of company culture, or a problem that no one discusses because it is like the emperor with new clothes. This is sometimes the best place for a data science team to focus. 22. The data science team hasn’t built trust with stakeholders Let’s be honest. Even if a team develops a 100% accurate algorithm with accurate data, if a team has not been working to build executive trust the entire time, then the project will fail. Why, because every actionable insight a project provides will be questioned, and never implemented. 23. Failing to communicate the value of the data science project One of the problems our data science consultant team has seen is teams failing to explain the value of a project. This requires...data! You have to use financial numbers, resources saved, competitive advantage gained, etc. To prove to the executives why the project is worth it! The data scientists, use that to help prove their point! 24. Lack of a standardized data science process No matter how good the data scientists are, without some form of standardization, a team will eventually fail. This may be because a team has to scale and can’t or because a team member leaves. All of this will cause a once working machine to fail. 25. If You Failed To Plan, Plan to Fail When it comes down to it. There needs to be some amount of planning in the data science projects. You can’t just attempt to find some data sources, make assumptions, attempt to implement some new piece of software without first analyzing the situation! This might take a few weeks and the executives should give you this. If they really want a sustainable piece of software. 26. The data science team competes with other departments(rather than working together) For some reason or another, office politics exist. Data scientists can often accidently walk over every other department because they are placed in position to help develop strategies and dashboards for the entire company. This might take away jobs from other analysts completely. In turn, this might start fights. So make sure the data science team shares and shows how their projects are helping rather than hurting! 27. Allowing company bias to form conclusions before the data scientists start Data bias does exist! As a data scientist you can make algorithms and data say whatever you want them to sometimes. However, that doesn’t make it true. Make sure you don’t go into the project with a biased hypothesis that will push you towards early conclusions that might be incorrect. 28. Try to take on to large of a first project Reading the news about what Google and Facebook are doing with their algorithms may tempt the data science team to take on too large of a project for their first projects. This will not lead to success. You might be lucky and succeed. However, you are taking a huge risk! 29. Manually classifying data One part of data science that not everyone talks about is data classification. Not just using SVM and KNN algorithms. Nope, we mean actually labeling what the data represents. Someone human has to do that first. Otherwise, the computer will never know how to. If you don’t have a plan on how to classify the data before it gets to the data science team, then someone will have to manually do that. That is one quick way to lose data scientists and have projects fail. 30. Failing to understand what went wrong Data science projects don’t always succeed. The data science team needs to be able to explain why. As long as it wasn’t a huge drop in the capital budget executives should understand. After all, projects do fail, it is natural. That doesn’t give you an excuse to not know why. 31. Wait to seek out outside help until it is too late Sometimes the data science team is short on staff, other times you just need new insight. Whatever it might be. The data science team needs to make sure it seeks outside help sooner rather than later. Putting off for help when you know you need it will just lead to awkward conversation with management. They might not want to spend the money, but they also want a project to succeed. 32. Fail to provide actionable insights and opinions Finally, the data science teams data science project needs to provide actual insight, something actionable. Simply providing a correlation, doesn’t do any good. Executives need decisions, or data to make decisions. If you don’t give them that, you might as well not have a data science team. If you have any questions, please feel free to comment below! Let us know how we can help! As more companies turn towards data science and data consulting to help gain competitive advantage. We are getting more people asking us - Why should we start a data science team? We wanted to take a moment and write down some reasons why your company needs to start a data science team. It is not just a fad anymore, it is becoming a need! Maintain Competitive Advantage As more and more companies start integrating data science and better analytics into their day to day strategy. It will no longer be a competitive advantage to make data driven decisions. It will be a necessity. Executives will have to rely on accurate data and sustainable metrics to make constantly correct decisions. If your company moves based on the actual why, vs. the speculations and surface level problems, then they can make a greater and more effective impact internally and externally. Better Understanding of Current and Possible New Customers Data science helps give executives and managers a deeper understanding into why customers make the choices they do. Google has had the greatest opportunity as it has almost become a third lobe of some people’s brain. We tell it when we are happy, sad, looking for love, studying, etc. It knows what our questions are. However, Google is not the only one, other companies are beginning to realize that their customers have been telling them their opinions on multiple social platforms and blogs. They just need to go and look and see how to better hear from their customers data. This is a great opportunity for corporations to seek out what their customers are feeling, about their products, about their companies image and what other stuff they are purchasing. Sometimes this may require purchasing third party data. Even then, this might be worth it depending on the projects being done. Better Understanding of Internal Operations Whether it is HR, finance, accounting or operations, data science is helping tie all these fields together and paint a better picture of what is happening in a company. Why are people leaving, why was there so much overtime, how can a company be better at utilizing resources, and so on. These questions can be answered by taking high quality data and blending it to find out the whys. This in turn will provide a better work place environment and increase resource efficiency. Increases Performance Through Data Driven Decisions Data science is still a relatively new field. Many companies are still figuring out how to properly implement data science solutions. However, those that do are seeing amazing results. Look no further than Amazon, or Uber. These companies are changing and have changed the way we view certain industries. Why, because they know what their customers want, they know what the industry is doing wrong, and they know how to charge the customer less, but give them more. Overall, increasing data science and analytics proficiency allows for your executives to trust their data more and make clearer decisions. Consider looking into a team today!
Unstructured Data, and How to Analyze it!
Content creation and promotion can play a huge role in a company's success on getting their product out there. Think about Star Wars and Marvel. Both of these franchises are just as much commercials for their merchandise, as they are just plain high quality content. Companies post blogs, make movies, even run pinterest accounts. All of this produces customer responses and network reactions that can be analyzed, melded with current data sets and run through various predictive models to help a company better target users, produce promotional content, and alter products and services to be more in tune with the customer. Developing a machine learning model can be done by finding value and relationships in all the different forms of data your content produces, segmenting your users and responders, and melding all your data together. In turn, your company can gain a lot more information, besides the standard balance sheet data(see picture above). Change Words to Numbers Machine learning has created a host of libraries that can simplify the way your team performs data analysis. In fact, python has several libraries that allow programmers with high level knowledge of data science and machine learning application design and implementation the opportunity to produce fast and meaningful analysis. One great Python library that can take content data like blogs posts, news articles, and social media posts is TextBlob. TextBlob has some great functions like
“Scary Monsters love to eat tasty, sweet apples” You can use the lines below to pull out the nouns and what was used to describe said nouns. How to use TextBlob to Analyze Text Data
This takes data that is very unstructured and hard to analyze, and begins to create a more analysis friendly data sets. Other great uses of this library are projects such as chat bots
From here, you can combine polarity, positivity, shares, topic focus to see what type of social media posts, blog posts, etc, become the most viral. Another library worth checking out are word2vec which exists in Python, R, Java, etc. For instance, check out deeplearning4j.
Marketing Segmentation with Data Science
Social media allows for once hard to get data such as, people's opinions on products, their likes, dislikes, gender, location, and job to be much more accessible. Sometimes you may have to purchase it, other times, some sites are kind enough to allow you to take it freely. In either case, this allows companies an open door to segmenting markets with much finer detail. This isn’t based off of small surveys that only have 1000 people, we are talking about millions, and billions of people. Yes, there is a lot more data scrubbing required. But there is an opportunity to segment individuals, and use their networks to support your company's products. One example is a tweet we once passed off to SQL Server. They quickly responded. Now, based off the fact that we interacted with SQL Server and talk so much about data science and data. You probably can assume we are into technology, databases, etc. This is basically what twitter, facebook, Google, etc do to place the right ads in front of you. They also combine cookies, and other data sources like geolocation. If you worked for Oracle, perhaps you would want me to see some posts about the benefits of switching to Oracle, or ask for my opinion on why someone prefers(we personally have very little preference, as we have used both, and find both useful) using SQL Server over Oracle. Whatever it may be, there are opportunities to swing customers. Now what if your content was already placed in front of the right people. Maybe you tag a user, or ask them to help you out or join your campaign! Involve them, see how you can help them. For instance, bloggers are always looking for ways to get their content out their. If your company involves them, or partners with them in a transparent way. Your product now has access to a specific network. Again, another great place where data science and basics statistics come into play. If you haven’t tried tools like NodeXL, it is a great example of developing a model to find strong influencers in specific networks. This tool is pretty nifty. However, it is limited. So you might want to make some of your own.
Utilizing the data gathered from various sites, and algorithms like K nearest neighbor, PCA, etc. You can find the words used in profiles, posts and shares, the company's customers interact with, etc. Then:
The lists goes, on. It may be better to start with NodeXL, just to see what you are looking for. Now what is the value of doing all this analysis, data melding, and analytics? ROI Of Content: At the end of the day, you have plenty of questions to answer.
These aren’t the easiest question to answer. However, here is where you can help turn the data from your social presence into value for your company: Typical predictive analytics utilize standard business data(balance sheet, payroll, CRM, and operational data). This limits companies to the “what” happened, and not the why. Managers will ask why did the company see that spike in Q2? Or dip or Q3? It is difficult to paint a picture when you are only looking at the data that has very little insight into the why. Simply doing a running average isn’t always great and putting in seasonal factors is limited to domain knowledge. However, data has grown, and now, having access to the “Why” is much more plausible. Everything from social media, to CRMs to online news provide much better insight into why your customers are coming or going! Automation This data has a lot of noise, and it wouldn’t really be worth it for humans to go through it.. This is where having an automated exploratory system developed will help out a lot. Finding correlations between content, historical news, and company internal data would take analyst's years. By the time they found any value, the moment would have passed. Instead, having a correlation discovery system that is automated will save your company time, and be much better at finding value. You can use this system to find those small correlating factors that play a big effect. Maybe your customers are telling you what is wrong with your product, and you just aren’t listening. Maybe, you find a new product idea. In the Acheron Analytics process, this would be part of our second and third phase. We always look for as many possible correlations, and then develop hypotheses and prototypes that leads to company value. This process allows companies to have data help define their next steps. This provides their managers with data defended plans. Ones that they can go confidently to their managers with. When it comes to analyzing your company's content and marketing investments, utilizing techniques like machine learning, sentiment analysis, segmentation which can help develop data driven marketing strategies. We hope this inspired some ideas how to meld your company’s data! Let us know if you have any questions.
In the era of data science and AI, it is easy to skip over some crucial steps such as data cleansing. However, this can cause major problems in your applications later down in the data pipeline. The promise of possible magic like data science solutions can overshadow the necessary steps required to get to the best final product. One such step is cleaning and engineering your data before it even gets placed into your system. Truthfully, this is not limited to data science. Whether you are doing data analytics, data science, machine learning, or just old fashioned statistics, data is never whole and pure before refining. Just like putting bad unprocessed petroleum into your car, putting unprocessed data into your company's systems will either immediately, or eventually wreak havoc(Here are some examples). Whether that means actually causing software to fail, or giving executives bad information both are unacceptable.
We at Acheron Analytics wanted to share few tips to ensure that whatever data science/analytics projects you are taking on, you and your team are successful. This post will go over have some brief examples in R, Python and SQL, feel free to reach out with any questions. Duplicate Data Duplicate data is the scourge of any analyst. Whether you are just using excel, Mysql, or Hadoop. Making sure your systems don’t produce duplicate data is key. There are several sources to duplicate data. The first comes from when the data is input into your companies data storage system. There is a chance that the same data may try to sneak its way in. This could be due to end-user error, a glitch in the system, a bad ETL, etc. All of this should be managed by your data system. Most people still use RDBMS and thus, using a unique key will avoid duplicates being inserted. Sometimes, this may require a combination of fields to check and see if the data being input is a duplicate. For instance, if you are looking at a vendor invoice line item, you probably shouldn’t have the same line item number and header id twice. This can become more complicated when line items change(but even that can be accounted for). If you are analyzing social media post data, each snapshot you take may have the same post id but have altered social interaction data (likes, retweets, shares, etc). This references slowly changing dimensions, which, is another great topic for another time. Feel free to read up more on the topic here. In both cases, your systems should be calibrated to safely throw out the duplicate data and store the errors in some error table. All of this will save your team time and confusion later. Besides the actual source data itself having duplicates. The other common duplicate that can occur is based off an analyst's query. If, by chance, they accidentally don’t have a 1:1 or 1 : Many relationship on the key they are joining on, they may find themselves with several times the amount of data you started with. This could be as simple as restructuring your team's query to make sure they properly create 1:1 relationships, or...you may have to completely restructure your database. It is more likely the former option. How to Get Rid of Duplicate Data in SQL
Missing Data Has your company ever purchased data from a data aggregator and found it filled with holes? Missing data is common across every industry, sometimes it is just due to system upgrades and new features being added in, sometimes just bad data gathering. Whatever it might be, this can really skew a data science projects results. What are your options then? You could ignore rows with missing data, but this might cost your company valuable insight and including the gaps will produce incorrect conclusions. So, how do you win? There are few different thoughts on this. One is to simply put a random and reasonable number in place of nothing. This doesn’t really make sense, as it is difficult to really tell what is being driven by what feature. What is a more common and reasonable practice is using the data set average. However, even this is a little misleading. For instance, on one project we were involved with, we were analyzing a large population of users and their sociometric data(income, neighborhood trends, shopping habits). About 15% of the data was missing that was purchased from a credit card carrier. So throwing it away was not in our best interest. Instead, because we had the persons zipcodes, we were able to aggregate at a local level. This was a judgement call. A good one in this case. We compared this to averaging the entire data set, and we really got a much clearer picture on our populations features. The problem with a general average over several hundred thousand people is that you will eventually have some odd sways. For instance, income, if your data set is a good distribution, you will end up with your average income being, well, average. Then, suddenly, people that may have lived in richer neighborhoods may suddenly create their own classification. The difference between 400k vs 50k(even when normalized) can drastically alter the rest of the features. Does it really make sense for someone who is making 50K a year to be purchasing over 100k of products a year? In the end, we would get a strange cluster that was large spenders, who made average income. When your focus is socio-economic factors. This can cause some major discrepancies. How to Handle Missing Data with SQL
Data Normalization
Data normalization is one of the first critical steps to making sure your data sensible to run in most algorithms. Simply trying to feed in variables that could be anything from age, income, computer usage time, etc, creates the hassle of trying to compare apple to oranges. Trying to input 400k to 40 years will create bad outputs. The numbers just don’t scale. Instead, the concept of normalization allows your data to be more comparable. It takes the max and min of a data set and sets them to the 0 and 1 of a scale. Now, the rest of the numbers can be scaled. Utilizing 0-1 allows your data science teams to meld the data smoother. They are no longer trying to compare scales that don't match. This is a necessary step in most cases to ensure success. R Progamming Normalization
Python NormalizationPython(This can also depend on whether you are using Numpy, Pandas, etc)
Final Thoughts Data preparation can be one of the longer steps when preparing your teams data science project. However, once the data is cleaned, checked, and properly shaped, it is much easier to pull out features, and create accurate insights. Preparation is half the battle. Once the data is organized, it becomes several times easier to mold. Good luck with your future data science projects and feel free to give us a ring here in Seattle if you have more questions about your data science projects Future Learning! And Other Data Transformations We wanted to supply some more tools to help you learn how to transform and engineer your data. Here is a great video that covers several data transforms. This particular video relies on the R programming language. |
Our TeamWe are a team of data scientists and network engineers who want to help your functional teams reach their full potential! Archives
November 2019
Categories
All
|