Parsing of PDFs is never an easy job. While working on my last placement with the client, I got an interesting task to convert PDF forms filled by subcontractors into the database. Adding to the challenge some of the PDF forms were poorly scanned, handwritten or tilted. Also, a lot of responses in the forms were checkbox questions which were not easy to parse.
I tried various options to convert the PDF forms to the database. As per my explorations, some of the options which can be helpful for a user are as follows –
- Using Run Command Tool – Initially, I used the approach to use the DocToText app triggered by the Run Command Tool to parse the text from PDF documents. Using the open-source program called DocToText. This program can be run at the command line to convert these file types to plain text, which Alteryx can read with no issue.
More information about the same can be found at the link:
This approach works fine but it is not always optimal. Also, it did not work for me with low quality or handwritten scans.
- Using PyPDF2 – As the next step, I used Python Code tool with PyPDF2 package to extract text from PDF. Once the text is extracted from the PDF, standard tools from Alteryx Designer can be used to further analyze the text and parse it.
You may explore the following link to explore more information on it:
Using Python script with PyPDF2 in Alteryx, I was not able to convert the low-quality scans properly.
- Using Google Tesseract OCR– Tesseract is an OCR engine with support for Unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages. Using Google Tesseract with Python script in Alteryx, I was able to convert most of the PDFs in a text file.
More information about the same can be found at the link:
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Alteryx-OCR-Tools/td-p/459695
But this also was not able to give me reliable results for checkbox responses.
- Using Google Cloud Vision API –The Vision API can detect and transcribe text from PDF and TIFF files stored in Cloud Storage. The Vision API accepts PDF/TIFF files up to 2000 pages. Larger files will return an error. Currently, PDF/TIFF document detection is only available for files stored in Cloud Storage buckets. Response JSON files are similarly saved to a Cloud Storage bucket. Using Cloud Vision API via Python script, I converted the documents into text files which can be further parsed out using Alteryx.
This video gives more details on how to set up the Google Vision API in Python to Detect Text in Files (PDF/TIFF):
https://www.youtube.com/watch?v=HMaoUdJQEgY&list=PL3JVwFmb_BnSLFyVThMfEavAEZYHBpWEd&index=5
But the results were still different and especially unreliable for checkbox questions.
- Using Google Cloud Vision OCR API – The Vision API can detect and extract text from images. Using Cloud Vision OCR API via Python script, I converted the documents into text files which can further be parsed out using Alteryx.
This video gives more details on how to set up the Google Vision API in Python to Detect and Extract Text (Image):
https://www.youtube.com/watch?v=zOKz0e8flTw&list=PL3JVwFmb_BnSLFyVThMfEavAEZYHBpWEd&index=3
The repose gives the coordinates of each polygon containing text.
But since the coordinates were different in my PDFs so the results did not match.
In the next steps of my exploration, I also examined a few software available to convert PDFs to excel format/ database directly. Few of them which can be beneficial to the user are as follows-
- PDF Element Software tool – PDFelement is a powerful, yet easy to use PDF editing system where you can create, organize, and secure your files. This tool directly converts the PDF to excel file, but it was not able to convert the checkbox questions. As most of the questions in the PDF are checkbox, So I had to look for other software.
- Docparser Tool– Docparser is a document data capture solution built for today’s modern cloud stack. It automatically fetches PDFs or scanned documents from various sources and extract specific data fields or tables. The Docparser tool provides more functionality for converting the PDF to excel file. We can set up the Parsing rules by following the various options. I used the options to pre-process the PDFs by Auto Zoom the content and fix the tilt/alignment of the documents. Then rules can be fixed to parse the required field based on another field. But still, the documents are not produced the same results. There is an option to parse the checkbox results based on pixel coordinates, but since the coordinates even in similar PDFs are different, so it was not able to produce consistent results.
- Adobe Acrobat Pro and Alteryx– Adobe Acrobat Pro is an optical character recognition (OCR) system. It is used to convert scanned files, PDF files, and image files into editable/searchable documents. The Tools tab organizes Acrobat Pro DC’s variety of tools in a single pane, organized by function: “Create & Edit,” “Share & Review” “Forms & Signatures,” “Protect & Standardize,” and “Customize.” Using Adobe Acrobat Pro, I converted the PDFs to excel format and then used Alteryx to parse the excel files and then write to the database.
Alteryx workflow to convert a batch of excel files to right format-
Macro to parse the excel file converted by Acrobat pro-
The powerful combination of Acrobat and Alteryx worked for me to convert the PDFs to the right format and write it to the database.
Although the other options mentioned in the blog are also strong still it depends on kind of PDF and your required database to be parsed from PDF that which tool works best for you.
Thanks for reading my blog, if you have any suggestions or comments feel free to connect with me via LinkedIn!