Web scraping (or data scraping) is the process of collecting, transforming and storing content from the internet for later analysis. Think about it like copying and pasting a few paragraphs from a website into Word, but on a bigger scale.

As the volume of data stored on the internet expands, so does the potential uses of web scraped data – from running sentiment analysis on comments sections, to scraping the pricing information of your competitors, web scraping is an essential tool in any good analysts arsenal.

I recently used the web scraping tools native to Alteryx to download and transform 11 years worth of global University rankings from the Center for World University Rankings into the running bar chart below.

 

 

Historically, that would have likely involved hours spent copying, pasting and reformatting information into various excel spreadsheets – but with Alteryx it only took a few clicks and I was away.

Learning how to web scrape well will take time, but thankfully my colleagues have put together a series of great tutorials on the Data School blog which demonstrate all key steps and components you’ll need to know to start parsing like a pro.

I’ve divided these guides into three broad sections:

  • Beginner – The basics of web scraping using Alteryx, what the tools are and how to use them.
  • Intermediate – How to code using RegEx, the language you’ll use to transform HTML into actionable data.
  • Expert – Using advanced macros in Alteryx, and other tools, to take supercharge your web scraping.

Web Scraping: Beginner

Web Scraping HTML Tables, an Alteryx workflow and R script example
A step by step walk through on web scraping HTML tables using Alteryx and R Studio.

From HTML to Alteryx
A comprehensive step-by-step guide from Jasna Dishlieska-Mitova on using Alteryx to web scrap online datasets.

Slide into my DLs (downloads) – A blog on the Alteryx Download tool
A great guide from Andrew Banh on the many features of Alteryx’s web download tool – the cornerstone of web scraping.

Web Scraping The Data School Blogs in Alteryx
A cool demonstration from Sebastian van Gerwen on using Alteryx web scraping to analyse The Data Schools blog.

 

Regex: Intermediate

Alteryx Regex – From ‘zero’ to ‘hero’
A great introductory guide from Thang Nguyen on Regex – the language of web scraping. A must read if you’re starting your web scraping journey.

How to learn RegEx (the painless way)
Grace Murphy’s six steps to learning Regex.

Regex for Data Parsing
Another Regex guide from Ivy Yin.

Parsing data with RegEx
Another great Regex guide from Anders Wold.

Splitting to columns with Regex_replace in Alteryx
If you already understand Regex, this advanced guide from Grace Murphy shows how to use it to perform more complex data manipulations.

 

Advanced: Custom Alteryx Macros and Advanced Web Scraping Tools

I Created an HTML/XML Parser Tool in Alteryx Using Python SDK
Laszlo Dobiasz’s amazing custom macro for HTML/XML web scraping. A must add to your Alteryx tool palette.

Web scraping made easy: import HTML tables or lists using Google Sheets and Excel
An interesting tutorial on web scraping tables with minimal code or programming involved.

Web-Scraping using Batch and Iterative Macros
Jonathan Waerner shows how combining web scraping and batch macros can produce powerful results – downloading data from multiple websites with the click of a button.

 

 

Kieran Adair
Author: Kieran Adair