Here’s a story of how I extracted hidden data from an Excel template and saved hours of manual work. What I learned might just help you some day so come along for the journey and learn about XML files.

The Hidden Life of XML Files

An XML file by any other name would smell as sweet.

If you’re as curious as I am you might have noticed just how many proprietary file formats are just plain XML files.

What’s an XML file? XML stands for eXtensible Markup Language, and is basically a text-based file format designed to be both human-readable and machine readable. These days it’s a little less popular than JSON but it serves largely the same purpose; transporting data from one place to another.

Here’s a super simple example:

<?xml version="1.0" encoding="UTF-8"?>
<book>
  <title>My Book Title</title>
  <author>John Doe</author>
  <published>2022</published>
</book>

You might notice that the structure is very similar to HTML.

Here’s a few examples of file formats that are actually XML:

  1. Tableau Workbooks
  2. Alteryx Workflows
  3. Microsoft Office Documents (Word, Excel, PowerPoint etc.)

That last one is the one I’d like to focus on for the rest of this blog post.

The Problem: Dynamics CRM Import Templates

For the last couple of months I’ve been working on a massive CRM upgrade project that involves extracting, cleansing, and transforming data to match import templates for uploading to the new CRM.

Because we were migrating from an old instance of Dynamics to an up-to-date version, you’d think that it would be fairly simple to populate the new templates, right? Well, no.

  1. The first issue is that the column names in the CRM database differ from the template names; the former using the “logical name” and the latter using the “display name.”
  2. Similarly, the name of the template file uses the display name which does not always match the table name in the database.
  3. Finally, there is data validation logic in the Excel file that contains information about required fields, as well as drop-down lists containing the possible options a field can be.

Front-end Data Validation in Excel

These Excel templates contained all of the validations that data needs to pass in order to be uploaded into the system:

  1. Data Format and Length Constraints Whether the field is text, an integer, or a decimal number, the template will tell you what kind of data it is expecting. It will also give you a maximum length, or a numerical range that the value is allowed to be in.
  2. Option-set Drop Down Lists Some fields have a set of options that are allowed to be selected for that record. Any text that does not match one of these options will fail on import. These drop-downs are allowed to have up to 500 items so copying them by hand is not an option.
  3. Mandatory Fields (and Lookup Fields) Fields that are mandatory have tooltips that contain (required). Also, lookups are references to other tables, and any lookups must be already present otherwise the import will fail. Note that “Owner” is not a table, it actually refers to the “systemuser” table (not that you would know that by looking at it).

A Problem, Intensified

To map the source data to the import templates I needed to match each of the columns in the template to the raw database extract.

If it was one template with a few fields then I’d probably just map it by hand. However…

  1. Because the new CRM was in development, fields in the template could change as the application was updated.
  2. I had 15 templates to fill out, with contacts having 90 columns alone.
  3. Some of the columns in the template were new and custom, meaning there was no match to the old data.

The first time I tried to fill out these templates, mapping the columns by hand, it took me about 3 hours and that didn’t include any data validation.

A Problem, Solved

There had to be a better way. I just did not have the time to spend three hours every time a template changed. So I looked to the XML to see what I could find.

Without going into too much detail, an Excel file is just a zipped folder of XML files containing all the data in the worksheets. Basically everything you can see when you use excel should be in those XML files somewhere.

The screenshot above is from the sheet.xml file with the name “hiddenDataSheet” and if you look in the workbook.xml file you will see that it’s state is “veryHidden”, which is true because you can’t see its contents any other way.

Let’s look at this really helpful pile of non-sense:

If you take that text string and URL-decode it; after some re-formatting you will end up with something similar to what you see below:


What you’re seeing is

  • Table Logical Name = account
  • Some kind of base64 checksum
  • Logical to Display Name Mapping for every field

Can you skip to the good part?

This post is already far too long so I’m going to wrap it up here by showing you a snippet of the JSON file I created that I used as part of my Python workflow to extract, map, and validate every single column for every single row of the data in minutes.

The best part is that my workflow download the current template so if there are any changes to columns it will automatically take it into consideration.

 

Deep Dive Done

Fair to say I’ve lived up to my Data School nickname of “Deep Dive Dan” on this one. I hope you enjoyed getting a glimpse at what’s possible if you peek underneath the curtains a little bit.

There’s so much information out there in plain sight, you just have to know what to look for.

Until next time.

Love, Dan

Daniel Lawson
Author: Daniel Lawson

Right off the bat I can tell you that I’m not your average data analyst. I’ve spent most of my career running my own business as a photographer and videographer, with a sprinkling of Web Development and SEO work as well. My approach to life and work is very T-shaped, in that I have a small set of specific skills complemented by a very broad range of interests; I like to think of myself as a dedicated non-specialist. Data Analytics, and Programming, started as a hobby that quickly grew into a passion. The more I learned the more I looked for opportunities to pull, manipulate, and join data from disparate sources in my life. I learned to interact with REST APIs for services I used, personal data from services I use like Spotify, and health data captured by my devices. I learned SQL to create and query databases, as well as analyse SQLite files containing my iMessages and Photos data on my Mac. Every technique I learned opened up more possibilities; now I’m hooked and there’s no turning back. Learn More About Me: https://danlsn.com.au