Last week we were working with a large data set of sales information, which equated to >200 million rows of data. The first couple of hours the team was fairly stuck in the mud, when trying to explore the data in Alteryx. In order to speed up the reading and writing parts of our Alteryx workflows, we made Calgary files containing the data and used these files instead of the yxdb’s.
A lot of the information in this blog came from the Alteryx help site. There is much more further information on Calgary there, including information on Tools I do not cover here.
What is it? How does it Work?
The Calgary file translates the data into an indexed database, and records the index mapping as separate tables within the same local directory.
In the Calgary load tool, you can configure which fields are indexed and how. You can also just let Alteryx decide this automatically too (I did, and it worked well). Leaving the index type as automatic, Alteryx will assess how many unique values are in that field. If there are more than 550, it treats that field as High Selectivity, otherwise it is treated as Low Selectivity. You can read here for more information on those index types.
Note that Alteryx will only make the decision on Index type on the first 1 million rows, so it may lead to errors if there are subsequent unique values after the first million rows. Manually configuring a Calgary Load tool (correctly) will speed up the Calgary Loader.
Is it Fast?
After loading to a .cydb file, you’ll notice that reading these large data sets can be read vastly quicker in Alteryx. Here’s some results of my comparison between yxdb, a csv and a cydb
In the above screenshot you can see the same data containing just less than 25,000 rows, stored in 3 different formats. In the results pane you can see the csv took the longest to load in (no surprises) in at 142.44 ms, the yxdb took 36.41 ms and the Calgary file with 21.92 ms. The gap in run time widens even further with larger files.
An interesting thing I learnt about Calgary is that it has a size limit, so any bigger than roughly 2 billion rows will cause the loads to fail. Alteryx describes the best use of Calgary to have a ‘sweet spot’ data size of 100 – 300 million.
In summary I think it is good to explore different options of file formats, depending on your needs. For us this week, Calgary was an easy solution to our load time problem.