In this blog, I am going to show you, how you can extract data from a pdf file. I’ll introduce you to three new tools in Alteryx and get you to a point where the data is ready for cleaning.

Important: The three new tools we’ll be using are Image Input, Image to Text, and Image Template. These tools are found in the Computer Vision pallet and are part of the Alteryx Intelligence Suite. Intelligence Suite requires a separate license and add-on installer to Designer.

Let’s say I’ve downloaded my bank statement as a pdf file and want to extract the data so I can prepare it for a dashboard in tableau. Let’s imagine this is a list of transactions on a bank statement, you could just connect to it directly using Tableau, but quite often, bank statements for one, are not this simple. There’s a lot of other data on there that you don’t need, so using these tools in Alteryx will help you grab just want you need.

Sample Data in a PDF

Step 1: Drag the Directory tool from the In/Out pallet to your canvas, Specify the Directory and File name.

Step 2: Connect an Image input tool to the Directory tool and set the config as in the image below, referencing the field name ‘Fullpath’ that is an output from the directory tool, and leave the name of the BLOB column as ‘image’ for the next step

Step 3: Connect an Image to Text tool to the Image input tool via the D anchor, here is where we need to reference the field ‘image’ as the Image and in our case, we will leave the default language as English.

Step 4: Connect the output of an Image Template tool to the T anchor of the Image input tool and configure as follows,

  1. Select ‘Open File’ and load the image of the pdf file
  2. Highlight the areas to export, in this case, I Click and drag to highlight table
  3. I give it the name ‘Transactions’ and set the type as ‘Table’ from the dropdown list.

 

At the end of this step, after I run the workflow, and look at the Results window for the Image to text output, I found the ‘Transactions’ field all the way at the end. This is the only field we need. Because it contains all the rows of data that were in the table.

Step 5: Use a Select tool to filter out the rest of the fields, keeping just the field named ‘Transactions’

Step 6: Transpose the data using a transpose tool. A Record ID tool is useful if I had highlighted more than one area in Step 4, this can be used as the ‘Key Column’ in the transpose config, in this example, I only have one row of data, so it wasn’t necessary.

Step 7: Add the Text to Column tool from the Parse pallet and split to rows using the \n (New line) as the delimiter. And that’s it! I can see the transactions it’s BAU when it comes to cleaning and preparing the data.

The power of Alteryx never ceases to amaze me, I hope you enjoyed this as much as I did and thank you for your time.

Jude Shu
Author: Jude Shu