During the training at The Data School we had many projects which involved scraping websites. Although there are ways to scrape websites in Alteryx, I always missed to have a tool dedicated to the task. In this post I will show how I created a brand new Alteryx tool with a little HTML, CSS, Javascript and Python knowledge using the Python SDK and what it does.

What is Python SDK?

The Alteryx Python SDK is a Python extension module that provides users the ability to create custom Alteryx tools using Python. The SDK allows users to access core elements of the Alteryx Engine. To create a custom tool the user needs to be familiar with Python and file management.

Creating a tool also involves using the HTML GUI SDK which is a library of extensions used to create the graphical user interface (GUI) for the configuration panel of the tool.
To be able to deal with the HTML GUI SDK the user need to be familiar with HTML, CSS, Javascript.

To learn more about building custom tools in Alteryx, click here

Why did I not just create a macro?

That is exactly what I did first. I used the Python tool and created a macro. The problem was that the Python tool can not handle the metadata in Alteryx. It meant that when I put the macro on the canvas I got a metadata error message.

The error message comes from the Python tool inside the macro. See image below.

The error message was something I could probably have compromised on but what it really meant was that the tool didn’t pass on the metadata to the next tool. So the following tool couldn’t recognize any fields.

This was very inconvenient. Although after running the workflow the error disappeared and the metadata was populated to the Select tool,

every time I changed anything on the configuration window of my macro, the error appeared again, and the workflow had to be run to populate the metadata.

There was another thing I didn’t like about the macro. The Python tool seemed to be very slow. A simple parsing task took way too much time to run.

Creating the Custom Tool

I wanted to create a tool which would allow the user to select parts of a HTML document by using XPath (XML Path Language) expressions. I knew the Beautiful Soup Python library is a popular library to parse HTML, however Beautiful Soup doesn’t support XPath expressions. I decided I would use the lxml library instead.

I don’t want to dive in deep details how I created the tool. It took some time to understand how to use the Python and HTML GUI SDKs. This article helped me a lot: https://community.alteryx.com/t5/Data-Science-Blog/Levelling-Up-A-Beginner-s-Guide-to-the-Python-SDK-in-Alteryx/ba-p/159440
This is another useful article about managing metadata: https://community.alteryx.com/t5/Engine-Works-Blog/Managing-Metadata-Made-Easy-with-the-Python-SDK/ba-p/190046

I investigated these example custom tools to understand how they come together: https://github.com/alteryx/python-sdk-samples

How to Use the Tool

The tool uses XPath (XML Path Language) to select elements and contents in a HTML or XML document. The language is easy to learn and well documented.

On the tool’s configuration window, the first field (Source Field) is to select the column which holds the HTML or XML document as a string. The second field (XPath Expression) is quite self explanatory. It is to write the XPath Expression which selects the required part of the document. The third field is to choose the name of the output field. By default, the name of the generated output field will be ‘XPath Result”

Here are some example expressions.

//article find all <article> tags
//article/h1 find all <h1> tags directly below an <article>
//article//h1 find all <h1> tags any level below an <article>
//article//h1/text() get the content of all <h1> tags anywhere below an <article>
//a/@href hrefs of all anchors
//a/text() get the text of all anchors
//table[3] find the 3rd table of the document
//table[3]//tr find every row in the 3rd table of the document
//table[3]//tr//td[1]/text() get the content of first column in each row of the 3rd table of the document

For more information on how to use XPath visit these websites:
Github XPath Cheatsheet
Devhints.io XPath Cheatsheet
Izone.de XPath Cheatsheet
XPath Cheatsheet PDF
w3schools Xpath Tutorial
Tutorialspoint Xpath Tutorial

Download

You can download the XPathParser.yxi installer from here. After running the installer, it will put the HTML / XML XPath Parser tool on the Parse palette. This is how the icon looks like:


You can also find an example workflow which scarpes the weekly challenges page from the Alteryx Community website.

I hope it will make web scraping easier for everyone and you will enjoy using this tool.

Update!

If you get errors using the tool after updating Alteryx from 2020.2 to 2020.3, you need to do the following:

  1. Delete the XPathParser folder from the following directory locations:
    %APPDATA%\Alteryx\Tools or %ALLUSERSPROFILE%\Alteryx\Tools if it’s an Admin install
  2. Re-install the tool
Laszlo Dobiasz
Author: Laszlo Dobiasz