Machine learning may seem intimidating at first, but with the help of modern technology and an abundance of information available online, anyone can learn this exciting field of computer science!

In essence, machine learning is the process of teaching computers to learn and make decisions based on data analysis, not unlike how humans learn. To illustrate, let’s say you want to teach your computer to recognize images of animals. You can start by showing the computer various pictures of different animals and describing them. From this information, the computer can identify patterns in the images and learn how to identify each animal. Once the computer has learned to recognize the animals, you can put it to the test by giving it a new image and asking it to identify which animal appears in the picture. The computer will analyse the new image based on its previous learning and make an informed guess about which animal is present. The applications of machine learning are vast and varied, from detecting fraudulent activities in financial transactions, which we will explore in this blog post.

Let’s look at the dataset from Kaggle that can be used for building a fraud detection machine learning model (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). With every model, we first have to give the training dataset to the machine, which will learn from it like in pictures of animals example. In this case, the machine will look at fraud transactions and standard transactions and learn how to identify them! However, the quality and the structure of the data play a significant role in our machine learning model effectiveness. The Kaggle data has the following distribution of the fraud vs non-fraud transactions.

The problem lies with the “laziness” of the model. Instead of predicting fraudulent transactions, the model will say that all new records from our test dataset are not fraud because of how “imbalanced” our dataset is. And it would still mean that the model is 99% effective! Because it predicted that the transaction is not fraudulent 99% of the time. In machine learning and particularly with classification we use a confusion matrix to see what predictions look like.

And there are three methods that we can use to fix the imbalanced dataset.

Undersampling. This means you try to fix this imbalance by removing some of the instances from the majority class so that the two classes are more balanced.

Oversampling. This means you try to fix the imbalance by adding more instances of the minority class (fraud transaction) to the dataset so that the two classes are more balanced. This can be done by replicating existing instances of class fraud transactions. The replication can be done using already existing records or by using Synthetic Data Generation (SMOTE). In simple words, instead of replicating and adding the observations from the minority class, it overcomes imbalances by generating artificial data.

Both undersampling and oversampling have their advantages and disadvantages, and the choice between them depends on the specific dataset and machine learning problem. The goal is to create a more balanced dataset that can be used to train a machine learning model that will not be biased towards one of the classes.

Luckily, Alteryx can apply undersampling or oversampling based on the balance of the data and it is size. Although SMOTE is not included in the ML tools of Alteryx, there are community Macros that allow you to use this method in Alteryx.

Veronika Varaksina
Author: Veronika Varaksina

Meet Veronika, a dynamic and adaptable individual with a diverse background in economics, accounting, finance, and data analytics. Veronika pursued a Bachelor’s degree in Economics and gained valuable experience in financial analysis, budgeting, and forecasting while working for five years in accounting and finance. However, she soon realized her passion for data analytics and decided to pursue a postgraduate degree in Analytics at Victoria University. Throughout her academic journey, Veronika honed her skills in data visualization, statistical modeling, and machine learning. Her expertise earned her a spot in the highly competitive Data School program, where she further continues to expand her skills in data analysis.