Hey ChatGPT, can you summarise this article for me?
Sure thing! The article discusses the problem of parsing unstructured data and the challenge it poses for data analysis. Dan proposes a solution using natural language processing (NLP) and specifically ChatGPT API, to classify parts of the text into different categories and extract the text from each category. Dan demonstrates the effectiveness of ChatGPT API through a practical example, showcasing its ability to extract specific information from an unstructured text field, ultimately leading to a viable solution for the parsing problem.

The Problem

Unstructured data is information that does not conform to a particular structure or format. It is often the result of raw input from a user, or data gathered in the wild. Unstructured data can make analysis difficult because most tools are designed to work with structured data in the form of rows and columns.

Using an example from a current project I’m working on at the moment, we use a project management system that tracks clients and their projects. For clients, the system only provides a field for the client’s name, and a field for their address. However, the address field if often used as a store for more than just the address. For example, the field may contain information about the primary contact, their contact information and some other details.

This is a problem because we want to be able to extract the address from the field, but we also want to be able to extract the contact information. We could use a regular expression to extract the address, but this would be difficult to maintain as the format of the data varies from client to client. Regular expressions are really only practical when working with regular data, it’s in the name after all.

A Possible Solution

When thinking about solving this problem I knew that a solution had to be out there, this has to be a common problem.

So let’s break down what we need to do:

  • We need to classify parts of the text into different categories
  • Then we need to be able to extract the text from each category

What does that sound like? It sounds like a natural language processing problem of course.

My first attempt was using the built-in NLP capabilities of Alteryx’s Intelligence Suite to do Named Entity Recognition (NER). This worked for the most part, but it was not perfect. It also required a bit of wrangling and logic to get the results I wanted.

On a whim I thought I’d try ChatGPT because I’d already been able to achieve some magic solving other problems. After a bit of tinkering I was able to get it to work using a prompt similar to this:

The Prompt

 

 

Asking ChatGPT to Parse Unstructured Text

Asking ChatGPT to Parse Unstructured Text

Here’s the prompt I used:
You are a helpful data quality assistant that is tasked with extracting contact information from unstructured data provided by the sales team in our CRM. From the JSON Object below, please extract any of the following fields that you find.

Desired Fields:
- Account ID
- Full Name
- Position
- Company
- Address Parts (Lines, Street Address, City, State, Postcode and Country.)
- Phone Numbers (E164 Formatted, with phone number type)
- Email Address
- Website
- Social Media Profiles (Full URL for the Platform)
- Other Information

JSON with Unstructured Text Field:
{
  "account_id": 270270270,
  "text": "Daniel Lawson, Data Analytics Consultant (The Data School Down Under) \nLevel 12, 500 Collins Street,
  Melbourne, VIC 3000, Australia. Main: (02)92600750.\nExt. 1234 M: 0499 555 555 Free Call: 1800 737 126 E: applications@thedataschool.com.au.\nW: thedataschool.com.au 
  LinkedIn: www.linkedin.com/in/danlsn Twitter: @_danlsn Insta: @_danlsn ABN: 36 625 088 726" 
}

Valid JSON Object with Snake Case Field Names:

Let’s break it down a bit. The first part helps give the model context for the role it is playing. This also leads into the second part which contains the different columns that we want to extract, based on the kind of data I know is present in some of the records. In the field definitions we can also include information about how we want the output to be formatted.

For example, I want the address to be broken down into its component parts, I want an array of phone numbers with the type of phone number included, and I want the social media profiles to be included as a url. With the phone numbers as well, I want them to be in E164 format, which is a standard for international phone numbers and can be widely used in other systems.

Finally, the last two sections include a JSON Object with the record we want to parse, and finishes with a prompt for the model to output a valid JSON object with snake case field names. This is important because it means that the output can be easily converted to a Python dictionary later on.

The Response

Here’s what I got in response, a valid JSON object with the fields I asked for. I thought that that was pretty amazing and I knew that it was a viable solution to the problem I had.

The Response from ChatGPT

The Response from ChatGPT

Let’s look at the output a bit more closely. The first thing to note is that the model was able to extract the account id from the JSON object. This was included in the prompt and will make it easier to join the record back into our system later on.

The next thing to note is that the model was able to extract the full name, position and company from the text. It was able to infer this information without any additional information in the prompt. This is because the model was trained to recognise these types of entities in a variety of contexts. The model was also able to accurate parse and part out the address field accurately, this will be really helpful when updating the CRM later on.

What amazed me was the phone number parsing. It was able to correctly identify the phone numbers and the type of phone number they were. It was also able to correctly format the phone numbers into E164 format despite not explicitly providing the country code in the prompt. I expect it was able to achieve this because of the Australian address as well as the structure of the phone numbers, but I was still pretty excited about it.

Email extraction is probably the least interesting part of the output, but it’s still useful. The same goes for the website extraction, it’s not the most exciting part of the output but it’s still useful. The model sidesteps the need to write a regex parser and the less regex you have to write the better (I still love regex though).

The social media handle extraction was really cool too! It was able to correctly identify the social media platforms from the text, and with its knowledge of the internet it was able to correctly format the urls for each platform.

Finally, because I added a catch-all field for any other information that the model might be able to extract, it was able to correctly identify the ABN number and format it correctly. This is really useful because it means that we don’t miss out on any information we didn’t explicitly ask for but that the model was able to extract.

 

Next Steps

I’ve gone on long enough for one blog post but there’s still so much to share. Because I needed to do this on a larger scale programmatically in Python, I needed an API to interact with. When I started exploring this OpenAI didn’t have a ChatGPT API, but they did have a GPT-3 API. I was able to use the GPT-3 API to achieve similar results after fine-tuning the prompt.

However, on March 1st OpenAI released their ChatGPT API using their enhanced GPT-3.5 language model. I was able to use this API to achieve even better results. Even better, the cost per token is 1/10th of the cost of the GPT-3 API despite the model being much more capable.

I’ll be writing a follow up post to this one to share my experience using the ChatGPT API, which will include some code snippets, so you too can use it in your own projects. I’m so excited about the potential of this technology and I can’t wait to see what can be achieved with it.

Until next time.

With love, Dan

 

Daniel Lawson
Author: Daniel Lawson

Right off the bat I can tell you that I’m not your average data analyst. I’ve spent most of my career running my own business as a photographer and videographer, with a sprinkling of Web Development and SEO work as well. My approach to life and work is very T-shaped, in that I have a small set of specific skills complemented by a very broad range of interests; I like to think of myself as a dedicated non-specialist. Data Analytics, and Programming, started as a hobby that quickly grew into a passion. The more I learned the more I looked for opportunities to pull, manipulate, and join data from disparate sources in my life. I learned to interact with REST APIs for services I used, personal data from services I use like Spotify, and health data captured by my devices. I learned SQL to create and query databases, as well as analyse SQLite files containing my iMessages and Photos data on my Mac. Every technique I learned opened up more possibilities; now I’m hooked and there’s no turning back. Learn More About Me: https://danlsn.com.au