A couple of weeks ago we were connecting our blog website’s Google Analytics Data. We had a great time comparing who the tops bloggers are, between and within Cohorts. During this week, I had a go at web-scraping our blog pages to get a URL for each blog, the tags, the text and publish dates. I found out that parsing the useful stuff from a HTML download needed to be done one page at a time, so I needed to use macros in Alteryx to repeat a process.

I broke up the scraping process into two steps, this first one to scrape out all author names from the website, and the next one to scrape out every blog URL for each author URL. Here’s a screenshot of my workflow.

 

 

The first macro I made is a batch macro, it uses the URL ‘https://www.thedataschool.com.au/blog/page/1/’ and replaces the 1, with 2 and then 3…. all the way up to 25 (from the generate rows tool).

The second macro takes each Author URL from the first macro output, and finds all blogs URLs and Publish Dates, this was another batch macro that went up to 4 pages for each author. I went with 4 pages because that’s how many pages the author with the most blogs has.

For the first macro, I had to manually check how many pages of blogs there are, to determine which pages to download. I could have put a sufficiently large number in, but that would download pages with no information, and be more time costly. Hard-coding the ‘number of pages limit’ was great to initially make the workflow, but probably isn’t great considering we have blogs going up constantly, so it may eventually not capture everything we have got up there.

To overcome this problem, I created another macro at the beginning of my workflow, this macro iteratively downloads the blog pages until the URL download results in a “Page Not Found” error.

 

 

When the macro stops iterating, it outputs the final page number, which I use as part of the condition expression in the generate rows tool.

 

 

To avoid downloading pages multiple times and to reduce the amount of iterations, I started the macro at page 26 and counted up. (I know there are 26 pages at this point in time, so there’s no need to check for pages 1-25)

We used this workflow to see a timeline of blog posts, and to scrape further information from each blog (blog tags and content).

Click here to download a csv of all our blog posts!

Any questions or feedback welcome!