Parsing unstructured text is a nightmare.

As Data Analysts, clean data is a bit of a unicorn. If you’re lucky you might get data in a nice tabular dataset with columns and rows, but this isn’t always the case.

Sometimes you’ll receive completely unstructured text that was mined from somewhere random in a completely unpredictable format. The fields may be different in each record, and they probably won’t be labelled consistently (if at all).

I faced a problem similar to this in my CRM project where I had a bunch of records copied from email signatures of clients into a free text field in our project management software. There was a lot of rich information hidden in the data but it was so inconsistent that brute forcing it with regex just wasn’t feasible at all.

I thought about using named entity recognition to identify different parts of the text with the hope of extracting the import parts from it but NER is fairly limited in the out-of-the-box entities that it can recognise without developing and training a bespoke model.

Using ChatGPT to parse unstructured text.

In my previous blog post on this subject (https://www.thedataschool.com.au/daniel-lawson/using-chatgpt-to-parse-unstructured-text/), I outlined how I tried using ChatGPT to help me solve this problem. With that test being successful I looked towards OpenAI’s APIs to do this job programmatically. 

Whilst the code I’ll show you below can be scaled to do many rows I’m going to show you a simple example using one fake record.

Here’s the input data:

 

Drum roll please…

Here’s the output that was generated by the API. We’ll go through it in a second.

Let’s unpack this response. 

  • First of all, it made easy work of the client_id field, but to be honest that’s pretty easy to parse by hand.
  • It was able to identify which text was my name and it knew how to split that into first and last names.
  • The API was able to identify the different parts of the address, and named the different parts correctly. However, with a bit of tweaking it’s possible to get better results than this.
  • It was good at identifying my title and company from the text with no issue.
  • Phone number parsing is probably my favourite part of this exercise. With a bit of context, it was able to take the phone numbers, label them correctly, add the correct country code as well as the right area code for the landline. Finally, it formatted the numbers in the proper E164 format.
  • Email addresses are pretty easy so it gets a meh for this one, website as well.
  • Social media was extracted really nicely, and it was even able to expand the abbreviations to Instagram, Twitter and GitHub correctly.
  • I designed the prompt to give me a catch-all other info field in order to collect any data that it could identify but that I didn’t plan to be there. In the past ABNs have often found themselves there.
  • I added my degrees to the text to test this functionality and happily it was able to figure out that these were my qualifications.

The code.

A Python tutorial is out of the scope of this blog post but I will share with you the code below.

If you know Python at all then it should be pretty straightforward. 

Basically, it prepares a request to the OpenAI API with a prompt similar to what you’d put into ChatGPT.

You specify the fields that you expect and want to extract and finish with a line specifying what the output should be.

I asked for “valid json output” so that I could parse it back in Python and use it in other areas of my workflow.

For now, take a look at the code that created the output above and hopefully it will open your eyes to what’s possible with the amazing new technology we’re seeing.

 

Thanks for stopping by.

If you found this interesting please don’t hesitate to reach out on LinkedIn or you can find me on my website (https://danlsn.com.au). 

I’d love to write more about Python which is my data analytics lingua franca and something I love writing in as much as I can.

The possibilities are endless if you know how to code and every day I learn more I unlock more and more possibilities.

Until next time.

Love,
Dan

Daniel Lawson
Author: Daniel Lawson

Right off the bat I can tell you that I’m not your average data analyst. I’ve spent most of my career running my own business as a photographer and videographer, with a sprinkling of Web Development and SEO work as well. My approach to life and work is very T-shaped, in that I have a small set of specific skills complemented by a very broad range of interests; I like to think of myself as a dedicated non-specialist. Data Analytics, and Programming, started as a hobby that quickly grew into a passion. The more I learned the more I looked for opportunities to pull, manipulate, and join data from disparate sources in my life. I learned to interact with REST APIs for services I used, personal data from services I use like Spotify, and health data captured by my devices. I learned SQL to create and query databases, as well as analyse SQLite files containing my iMessages and Photos data on my Mac. Every technique I learned opened up more possibilities; now I’m hooked and there’s no turning back. Learn More About Me: https://danlsn.com.au