What is CRISP-DM?

CRISP-DM stands for Cross Industry Standard Process – Data Mining. It has been a standard process for data mining. It seems this holds good with data projects or even data science projects with some imperfections.

Using process ad nauseam?

A lot of us who are already working/starting in data analytics seem to abhor long-winded processes that we need to follow. This process is akin to the way developers avoid documentation and process. As some data shows, process and documentation are necessary to deliver great products. At Data School Down Under, as part of our work, we deliver client projects time-boxed in a week. I would like to add some examples from our projects that I can correlate with this methodology.

What does it contain?

It has phases, generic tasks, specialized tasks and process instances. Phases and generic tasks are more abstract whereas specialized tasks/process instances are related to the project that is implemented.
It has 2 major parts i.e firstly, the Reference Guide, which describes the generic phases and tasks. Secondly, a User Guide that has checklists, questionnaires, tools and techniques etc. that help in the actual project work.

Major phases of CRISP-DM

I would like to describe the phases of CRISP-DM with some commentary from client projects that my cohort has executed at the Data School.

Business Understanding: Determine what type of problem that are we solving in technical terms( data classification). Understand the success criteria of the project as defined by the customer. A project plan is a requirement. As part of client project weeks, we have used client interviews, questionnaires, initial review of documentation and data to understand the business problem. This helped us to understand the business problem and also determine the success criteria for the project

Data understanding: Data extraction, Data exploration including statistical analysis to understand the data. In a client project context, understanding/building a partial data model, extracting the data in a tool like Alteryx and using its Data profiling/Data investigation tools are few ways to achieve a good understanding of the data. One can also quickly plot some visualisations in Tableau to understand the data better.

Data preparation: It is important to define exclusion and inclusion criteria as part of data preparation. This is an important step and needs discussion with the client to avoid any erroneous inference or results. Based on the model/understanding any derived attribute is also needed to be discussed. In data cleaning tasks, understand the various transformation (removing null, changing data types, joins exclude rows) that can exclude information. These need to be ratified.

Modelling: The data modelling phase consists of selecting the modelling technique, building the test case and the model. This is specific to the type of problem. As an example, in predictive analytics projects, this leads to building the model, creating the test set and deciding the evaluation criteria such as precision, accuracy or recall. Even in an analytics project, this could translate into building initial reports/visualization and matching them with existing reports to ensure that parameters, calculations and measures are right.

Evaluation: In the evaluation phase the results are checked against the defined business objectives. Some aspects are also covered in live projects with the practice of client demos for early validation and interpretation of results in business terms. Modelling and Evaluation is an iterative process.

Deployment: The deployment phase is described generally in the user guide. It could be a final report or a software component. The user guide describes that the deployment phase consists of planning the deployment, monitoring and maintenance. In a typical project, we could utilise the process when we finally build the components, documentation for handover to the client’s ops/data team.

Conclusion

CRISP-DM is a useful tool that can be an asset in a data analyst’s toolkit. As above, I have indicated the various areas it can be applied. It can help not only data analysts (beginner or intermediate skills) but also experienced practitioners avoid omnipresent data rabbit holes.

 

Reference:

Christoph Schröera, Felix Kruse, Jorge Marx Gómez A Systematic Literature Review on Applying CRISP-DM Process Model (Procedia Computer Science 181, 2021), 526–53.