1. The objective of the Alteryx Weekly Challenge #13
Following is the description from the Alteryx Weekly Challenge Page. The objective of this challenge is basically a Web Scraping by Alteryx, grabbing a table’s content out from a web page.
The use case:
We have HTML data that is in a single field, the HTML contains an HTML Table.
The input contains a series of name/value pairs within the description field. The description field has a HTML table that contains 14 name/value contained within <td> tags. Each pairing can be found on a different row (designated by the <tr> tag).
The objective is to produce a table containing the 14 name/value pairs.
The contents of the Html and the actual table in it are shown here:
2. Solution using the Regex Tool
- Use the Regex Tool to do the Web Scraping
As an exercise we did in our second week of Alteryx training, we managed to work out the Regex expression to grab the content of the table out together.
First let’s take a look at the Regex expression we created:
The expression matches all the possible patterns of the name/value pairs in the html. Here is the breakdown of the expression in details:
- <td> and <\/td>: All the name/value pairs are inside a <td>…</td> tag.
- The ( ): Grab the contents needed one by one by grouping.
- {*<* and >*}*: The beginning and the end of the contents needed might be < >or { }.
- The first \w+: Apart from the possible < and { start, there is a sequence of characters.
- The [ ]*: After the first character sequence, there might be a repeating pattern of a possible “-“, Space, or “.” , followed by a character sequence.
By checking the result of the Regex expression, we can see all the name/value pairs inside a <td> and </td> are being singled out and the Regex expression is ready to be used in a Regex tool in Alteryx.