In today’s blog, I want to briefly summarise  a general scheme for parsing data on websites. To do this, I use 2 different templates depending on the website content. Information on websites are generally displayed using HTML or JSON (And sometimes both!).

 

Parsing HTML

In some websites, documents designed for viewing on web browsers use HTML as the standard markup language. This usually looks like this:

HTML code on website

 

In Alteryx, this will appear as a very long string upon initial download. I find it useful to use a Browse tool to visualise the truncated string and determine an optimal strategy for parsing what I need from the HTML code. A common approach is to split the Downloaded Data into rows using the ” \n ” delimiter via the Text to Columns tool (i.e. so that it appears as in the above picture). Next, a Filter tool can be used to only show rows which contain the required data. As an example, a formula to only parse for links to csv files could be ” Contains([DownloadData],”csv”) ”. I have include a sample workflow below with annotations:

standard workflow for HTML parsing

 

Parsing JSON

I first encountered JSON data when downloading data through Zomato’s API (link here). JSON file format contains data objects consisting of attribute–value pairs. An example of this is:

Sample JSON code

 

In contrast to HTML, note that the string data is in a single line. Alteryx has conveniently included a JSON Parse tool to transform JSON data into rows of data with “name” and corresponding “value” fields:

 

In the above example, the numbers under the “JSON_Name” field are unique identifiers (they function as a unique row number). To convert this data into a useful format, we have to first separate this numeric identifier from the Attributes attached to them using a Text to Columns tool (using “.” as the delimiter). Next, we have to transform the data using the Cross Tab tool to get the following type of output:

 

I have also included a sample workflow in Alteryx for parsing JSON data:

 


As a final tip, if you ever find a JSON file format inside HTML, there is a quick way to parse for the JSON data.

🙂 Hint: look for the unique brackets enclosing JSON data  [{ ……………. }]

Hope this blog was useful for you and have a great weekend!

 

 

Alex Chan
Author: Alex Chan