Acheron Analytics
  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners

Web Scraping With Google Sheets

10/21/2017

3 Comments

 
Web scraping and utilizing various APIs are great ways to collect data from websites and applications that can later be used in data analytics. There is a company called HiQ that is well known for web scraping. HiQ crawls various "Public" websites to collect data and provide analytics for companies on their employees. They help companies find top talent using sites data like Linkedin, and other public sources to gain the information needed in their algorithms.

However, they ran into legal issues when Linkedin asked them to cease and desist as well as put in certain technical methods to slow down HiQ's web crawlers. HiQ subsequently sued Linkedin and won! The judge said as long as the data was public, it was scriptable!

This was quiet the blow for scrapers in general.

So how can your company take advantage of online public data? Especially when your team might not have a programming background.
Web scraping typically requires a complex understanding of HTTP requests, faking headers, complex Regex statements, HTML parsers, and database management skills.

There are programming languages that make this much easier such as Python. This is because Python offers libraries like Scrapy and BeautifulSoup that make scraping and parsing HTML easier than old school web scrapers.

However, it still requires proper design and a decent understanding of programming and website architecture. 

Let's say your team does not have programming skills. That is ok! One of our team members recently gave a webinar at Loyola University to demonstrate how to scrape web pages without programming. Instead, Google sheets offer several useful functions that can help scrape web data. If you would like to see the video of our webinar it is below. If not, you can continue to read and figure out how to use Google Sheets to scrape websites.

The functions you can use for web scraping with google sheets are:
  • ImportXML
  • ImportHTML
  • ImportFEED
  • ImportDATA

All of these functions will scrape websites based off of different parameters provided to the function. 
​


Web Scraping With ImportFeed

The ImportFeed Google Sheet function is one of the easier functions to use. It only requires access to Google Sheets and a URL for a rss feed. This is a feed that is typically associated with a blog.

For instance, you could use our RSS feed "
http://www.acheronanalytics.com/2/feed".

How do you use this function? An example is given below.

"=ImportFeed(
 "http://www.acheronanalytics.com/2/feed")

That is all that is needed! There are some other tips and tricks that can help clean up the data feed as you will get more than just one column of information. For now, this is a great start at web scraping.


Do The Google Sheet Import Functions Update?

All of these import function automatically update data every 2 hours. A trigger function can be set to increase the cadence of updates. However this requires more programming.

This is it in this case! From here, it is all about how your team uses it! Make sure you engineer a solid data scraping system.

Picture
The picture above is an example of of using the ImportFeed function.
​

Web Scraping With ImportXML

The ImportXML function in Google Sheets is used to pull out specific data points using HTML ids, and classes. This requires some understanding of HTML and parsing XML. This can be a little frustrating. So we created a step by step for web scraping for HTML.

Here are some examples from an EventBrite page.
  1. Go to https://www.eventbrite.com/d/wa--everett/events/
  2. Right Click Inspect Element
  3. Find the HTML tag you are interested in
  4. We are looking for <div class = list-card__body> Some Text Here</div>

    So this is the tricky part. The first part you need to pull out from this HTML tag is the type. This would be like <div>, <a>, <img>, <span>, etc. This first one can be called out using "//" then the tag name. Such as "//div", "//a" or "//span".

    Now, if you actually want to get the "Some Text Here" you will need to call out the class.

    That is done in the method shown in step 5. You will notice it combines using "//div" with the "[@class="class name here"].

  5. The xml string is “//div[@class=’list-card__body’]”

    There is another data value you might want to get. 

  6. We want to get all the URLs

    This case would involve wanting to pull out the specific value inside of the first HTML tag itself. For instance, <a href="https//www.google.com">Click here</a>.

    Then it would be like step 7.

  7. The xml string is “//a/@href”
  8. ImportXML(URL, XML String)
  9. ImportXML("https://www.eventbrite.com/d/wa--everett/events/",“//div[@class=’list-card__body’]”)

The truth about using this function is that it requires a lot of time. Thus, it requires planning and designing a good google sheet to ensure you get the maximum benefit from utilizing. Otherwise, your team will end up spending time maintaining it, rather than working on new things. Like in the picture below
Web scraping With ImportHTML

Finally we will discuss ImportHTML. This will import a table or list from a web page. For instance, what if you want to scrape data from a site that contains stock prices.

We will use the http://www.nasdaq.com/symbol/snap/real-time. There is a table on this page that has the stock prices from the past few days. 


Similar to the past functions you need to use the URL. On top of the URL, you will have to mention which table on the webpage you want to grab. You can do this by utilizing the which number it might be. 

An example would be ImportHTML("http://www.nasdaq.com/symbol/snap/real-time",6). This will scrape the stock prices from the link above. 

In our video above, we also show how we combine scraping the stock data above and melded it with news about the Stock ticker on that day. This could be utilized in a much more complex manner. A team could create an algorithm that utilizes the stock price of the past, as well as new articles and twitter information to choose whether to buy or sell stocks.

Do you have any good ideas of what you could do with web scraping? Do you need help with your web scraping project? Let us know!


Other great read about data science:

What is A Decision Tree

How Algorithms Can Become Unethical and Biased

Intro To Data Analysis For Everyone Part 1

Why Invest In A Data Warehouse?

3 Comments
    Subscribe Here!

    Our Team

    We are a team of data scientists and network engineers who want to help your functional teams reach their full potential!

    Archives

    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    February 2019
    January 2019
    December 2018
    August 2018
    June 2018
    May 2018
    January 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017

    Categories

    All
    Big Data
    Data Engineering
    Data Science
    Data Science Teams
    Executives
    Executive Strategy
    Leadership
    Machine Learning
    Python
    Team Work
    Web Scraping

    RSS Feed

    Enter your email address:

    Delivered by FeedBurner

  • Home
  • Who We Are
  • Services
    • All Data Science Services
    • Fraud and Anomaly Detection
    • Data Engineering And Automation
    • Healthcare Policy/Program ROI Engine
    • Data Analytics As A Service
    • Data Science Trainings >
      • Python, SQL and R Trainings
      • ARIMA And Predictive Model Forecasting
  • Contact
  • Acheron Blog
  • Partners