Before the Christmas break I did Alteryx Challenge 188 on parsing data. In the challenge, we are asked to extract relevant tax data from a text file and present it as shown in the screenshot below. This looks like a daunting task, but with a bit of RegEx magic and a plan of attack, it is not that difficult.
Below, you can see the workflow I created to solve the challenge. In this blog post I will walk you through how I parsed and extracted the relevant data using RegEx (this part of the process is boxed in blue below). I will focus on this part as it was the hardest part of the Alteryx challenge and the following steps can be used to solve similar parsing problems.
Step 1. Locate the relevant data
The first step is to locate the relevant data in the text file and find a pattern if possible. In the images below I have colour-coded the relevant data that we need to extract to solve the challenge. We have the first VAT number at row 47, followed by a header row and then the tax data on the following row(s). This pattern repeats for the rest of the text file. For the rest of this blog post, I will call the rows that contain VAT numbers “VAT rows” and rows with the tax data “Tax data rows”.
Step 2. Filter out all data rows we don’t need
Now that we have located the rows with the relevant data, we can use the filter tool combined with the REGEX_MATCH function to get rid of the rows of data we don’t need. In the screenshot below you can see the Alteryx workflow where I filter the VAT rows and Tax data rows separately. I used the RegEx-tool on Regex101.com to find the regular expressions that would match the VAT rows’ and Tax data-rows’ string patterns. In the screenshot from Regex101.com below you can see the regular expression for the VAT rows outlined in red and the successful matches highlighted in blue.
Step 3. Extract the relevant data from each row using the RegEx parsing tool
The next step is to parse the rows by extracting the VAT numbers from the VAT rows and extract the tax data (Doc number, Type, Tax amt) from the Tax data rows. This is done by using the RegEx parsing tool as seen in the workflow below. I used Regex101 again to find the regular expressions that would group and extract the relevant data fields.In the screenshot, you can see the regular expression for the Tax data from Regex101 outlined in red and the grouped data fields highlighted in different colours.
Step 4. Bringing the relevant data together
Now that we have extracted the data we need, the next step is to bring the relevant data back together into one data set. I am not going to go into detail about this process as it is not the focus of this blog post, but I will quickly explain the steps in the workflow below. First, we bring the VAT and Tax data into one data set using the union tool. Then, we use the Multirow tool to add the correct VAT number to the corresponding Tax Data rows. After this, we use the filter tool to get rid of null-rows and only keep rows with Tax Code P1 and P9 specified in the challenge description. Finally, we sort the columns and rows and use Auto field to set the smallest possible field type and size for the data fields.