Welcome to Part II of retrieving data from the Disney character API. In this blog, we will be looking at cleaning and parsing (aka splitting the data into columns) the data.

Previously in Part I

As a brief summary what we basically did was create URLs with the page numbers and downloaded the data from each page. You can find a more in depth look in Part I.

Getting our headers

The next step in data wrangling adventure with the Disney character data is to check the data for our headers. This will involve some steps and tools.

1. Filtering out unneeded headers

The data we receive after the JSON parsing tool will have our headers in the JSON_Name column and our data in the JSON_ValueString. But not everything in the JSON_Name column will be a header. When we inspect the data closely, we will see the following data in the column:

As the picture suggests, the rows that have useful data actually have the headers in the JSON_Name with “data” as its prefix. By filtering any values under the JSON_Name column with “data” through the “True” we can can filter out the useless information such as “count” and “totalPages” (and these will appear under the “False” output)

2. Parsing for headers

From here we will need to take the desired headers from the “data.[number]” string.

Underlined are our desired headers

This is where the Text-To Columns tool comes in handy. We can divide the JSON_NAME column into 4 columns through the “.”. Your result should end up like this:

From parsing the columns there is one thing we will need to do: We will need to create a Unique ID for each character. This will allow us to join another table later on (to be announced in Part III).

Remember in Part I we used the Rowcount column to add pages to a URL? By using a similar concept we can use the Rowcount column to label the rows that belong to a specific character. But how do we know which rows belong to which character?

Accounting for individual Disney characters

Let’s have a look at the values in JSON_Name column again:

The circled numbers relate to a specific character and this format continues to repeat throughout the data. In knowing that this number relates to a specific character, which column after the parse had that number?

JSON_Name2 does!

Alrighty, almost there! Let’s use a formula to create a new column that combines the RowCount and the character number to create an ID for the character:

This is the way

Now that our headers column is almost ready, we can remove the columns we don’t need (JSON_Name, JSON_Name1 and JSON_Name 2) and also rename the columns to easier identify them in the future.

And that’s it for Part II!

In Part III we will be filtering and transposing the character data!

The Data School
Author: The Data School