Data comes in all shapes and sizes. But what happens if the data you’re wrangling is in the millions of records?
Having worked with data with several million rows of data in the past few client projects, there’s been times where running a single Alteryx workflow can take several hours — especially if spatial data is involved. This begs the question, “How can I optimise this workflow performance?” One solution is to use the fastest database tools available on Alteryx, the Calgary tools.
What Are The Calgary Tools?
Designed for use for large scale databases with millions of records, the Calgary category includes tools for counting data and performing data analysis. With upper of 2 billion records, the Calgary tools are designed for that sweet spot of 100 – 300 million records, using an indexing methodology to sort and analyse data.
Want to use Calgary databases? It all starts here. To be able to use a Calgary file, you must first output the file you’re working with as a Calgary database file (.cydb) through the Calgary Loader. Under the standard indexing, you’ll have the option to manually select or deselect your data and index fields. Fields chosen in the Data column will be available for retrieval and fields that are chosen in the Index column will be available for queries. Taking the Fines for NSW as a sample dataset, I have created a Calgary database with all default settings.
A great use case for the Calgary Input is to query or sub-query a specific field. In this example, I have specified offences involving bicycles and value as not NULL as my query and my output has populated accordingly. While there are 5 dropdown options for query searches, its behaviour depends on whether the field is a string, date or numeric field. A string field would work better with a list setting and a numeric field with a range setting.
Asides from loading quickly, another core strength of the Calgary Input is its built-in count output functionality. Rather than using a formula tool and summarise tool to isolate and count the number of records matching a certain query (e.g. Offences involving bicycles), we can directly run a query in the Calgary Input and check the ‘Output Count Only’ option. This would return the count of records matching the query.
The Calgary Join tool is a specialised join tool for joining an input to a Calgary database, whether that be a Calgary input or a regular input. Note, this tool is an extension of the Calgary Input tool by allowing dynamic inputs to be queries. I could, continuing the NSW Fines example, use a join on a set of offence codes to bring in corresponding offence records from Calgary, giving me all the data on offences involving bicycles.