The sheer number of algorithms and tuneable parameters in machine learning can be daunting to any ML beginner. To create an optimized model requires an understanding of how these algorithms work and the role of each parameter. I will not explain the mathematics behind each model here – it can get very complicated. Instead, I will provide an intuitive guide to some of the most important parameters in some of the most popular models, and how these affect your overall predictions.
This article will observe the decision boundaries of a number of popular algorithms, tested against differently distributed data. The decision boundary of a classification algorithm is the point at which that algorithm changes its predicted class. In the simplest case of binary classification it is the point from which the model switches its prediction from a ‘0’ to a ‘1’.
Below you will see visualizations for four popular models – a logistic regression, support vector machine, decision tree and random forest. Each chart will feature a plot of the distribution of the data (in two dimensions), coloured by its class. In this case, the class refers to the target outcome – for binary classification that is a ‘1’ or ‘0’. By shading the decision boundaries in the background we can see how the different models are actually classifying.
By looking at these predictions we can see the distinct advantages and disadvantages of each model against different distributions. Each row of charts shows a new distribution of data. The 1st row is simplest – it is a generated dataset of two normally distributed 2D blobs where the prediction target is which blob the point belongs to. There is some overlap between the blobs that makes them difficult to classify, but most points are simple. The second row shows a similar distribution but with some correlation between the two dimensions. The third row shows an “opposing moons” shape, where the classifier must predict which moon shape each point belongs to – this is a significantly more challenging task for the algorithm. The fourth row is a concentric circles distribution where the target is which circle the points belong to. This task is also challenging. The final row is from a real dataset – the famous iris dataset, where the X and Y axis correspond to petal length and width respectively, and the prediction target is which of two particular kinds of flower that observation belongs to.
Each distribution shows the limitations and challenges for various algorithms. The logistic regression, for example, creates a linear decision boundary in all distributions. This is useful for some distributions – such as in the 1st, 2nd and 5th rows. However the 3rd and 4th rows show its limitations. The 3rd row is generated data in the form of opposing arcs – the linear decision boundary has a difficult time separating these points. The 4th row shows concentric circles – the logistic regression struggles even more. On the concentric circles data the logistic regression is clearly unable to properly fit to the data.
Support Vector Machine
In opposition to the logistic regression, the support vector machine performs well on most of the distributions. It is able to fit a linear boundary for rows 1, 2 and 5, and also a circular and curved boundary for rows 3 and 4. This is thanks to the kernel trick, which is a piece of mathematics that is far too complicated for this blog, that allows the SVM to project its classification boundary into another function in higher-dimensional space. Do not worry if that explanation does not make sense – SVMs are tricky to understand at a low level. At a high level, the most important insight is that the SVM is able to fit a smooth decision boundary to many non-linear distributions.
The decision tree seems to fit well to each distribution, but there are two important limitations that should be noted. Firstly, the decision tree only fits a decision boundary that is orthogonal to the axes. Because of this it has trouble fitting colinear and non-linear relationships. You can see this in its classification of the concentric circles distribution. Even though the true distribution is circles, the decision function fits a right-angled polygon shape.
Secondly, the decision tree has a tendency to overfit. For example, its classification of the opposing moons distribution features two thin strips in its boundary – these are fitted to one training instance (or point in the chart). But if the same decision tree were given another dataset with the same distribution, these thin strips would almost certainly not match to the new data. Thus we can visualise that the decision tree is overfitting.
The decision boundary for the random forest looks roughly like a smoothed decision tree. That is because a random forest is just an ensemble of decision trees (with some introduced randomness). These decision trees are trained on random subsets of the data, and so achieve different distributions. These predictions are combined to form the random forest (this is a slight simplification, but is a good enough intuitive explanation). Because of this, the random forest generalizes much better than the decision tree does. Although in these distributions it still looks like it is overfitting, some tweaking of parameters can improve it significantly.
Now that I have explained the models’ strengths and weaknesses, I will show some ways to remedy them with parameter tuning. Hopefully this will give an intuitive understanding of how those parameters affect predictions.
Logistic Regression Parameters
As explained above, a significant weakness of logistic regression was that it could only create linear decision boundaries. One way of remedying this is by transforming the data. Creating polynomial features of the data, for example, can allow the logistic regression to fit to nonlinear relationships. As a demonstrating, if we had two variables A and B, a two-degree polynomial transform would result in the creation of variables A, B, A², AB, B². For higher order polynomials this expansion can become very long. While this is not a parameter, it is a common method of improving logistic regressions. A logistic regression on polynomial features is visualised below.
We can see from this visualisation that higher degree polynomials are better able to represent the arc shape of the opposing moons distribution. A degree 3 polynomial fits the distribution well. However it is often difficult to choose the right degree for polynomial transform. As shown in the bottom right, a degree of 20 will likely overfit the distribution – that is the decision boundary fits too specifically to the training data and will not generalize to other distributions.
The above visualisations also introduce C, which is a regularization parameter. Regularization is a way of combatting overfitting. Here, a lower C means a more regularized model (one that fits less). Another way of explaining this is by the bias-variance trade-off. Regularization creates higher bias, but lowers variance – so the data overfit less (but potentially underfit). As you can see in the charts above, a good value of C helps correct the model even when a high polynomial degree is chosen.
Support Vector Machine Parameters
The most important parameter choice for an SVM is the kernel function. The above chart showed a Gaussian kernel function – this is the most popular choice for kernel function, and is the main reason why SVMs are so popular. This is a piece of maths that is far too complicated for me to explain here, but I encourage you to do more research and become acquainted with it. The other parameter that will be visualised is the SVM’s C. This is a SVM regularization parameter that works in a similar way to C for logistic regression (though this similarity is only in effect, not in their mathematical implementation).
As we can see above, the Gaussian kernel is able to fit to the shape of the data. It fits particularly well with a medium amount of regularization – C=0.01 clearly underfits, whereas C=100 might be overfitting slightly. C=1 is just right! In this case, the polynomial and sigmoid kernels do not fit the shape of the data well. However, using a polynomial or sigmoid kernel introduces other parameters that aren’t explored here (such as a polynomial degree and constant coefficient), so these models could be fitted better to the data with more parameter tuning.
Decision Tree Parameters
The only parameter of the decision tree that I will explore here is the maximum depth of the tree. A decision tree works by branching the data into leaves, each of which has a percentage of the data. The decision tree algorithm aims to make each leaf the most ‘pure’ – that is having the highest proportion of a given class. Different algorithms do this in different ways. The maximum depth of the tree defines the maximum number of times that tree will split. This is shown below (note that max_depth=None means the tree has no maximum constraint on its depth).
As we can see, smaller values of maximum depth regularize the decision tree. However, the decision tree tends to underfit when it is not overfitting. No parameters can fix the limitation of its decision boundary being necessarily orthogonal to the axes. Perhaps a better way to fix this is to use a random forest!
Random Forest Parameters
Random forests possess all the parameters of decision trees and more. This is because they are ensembles of decision trees and so the decision tree parameters apply to the trees that will be combined into the random forest. Below I visualise the same max_depth parameter applied to a random forest. I also visualise the number of estimators parameter – this controls how many trees should be ensembled to make the random forest.
While it is a bit more subtle than for the decision tree, the max_depth parameter also helps to regularize the whole random forest – we can see that where max_depth=None there is a small brown section in the upper-left that is likely overfitting. When max_depth=3 this section disappears. The number of trees parameter has a more subtle effect – it does not change the shape of the boundary much, but it introduces more trees that average over the data, creating shapes that look slightly more “rounded”. In this way the random forest is able to create a decision boundary that looks curved in some places, even though it is created from a combination of orthogonal boundaries! As a note, other ensemble of trees algorithms are very popular and have decision functions that take similar shapes to a random forest – these include AdaBoost and Gradient Boosting algorithms which are considered some of the most powerful machine learning algorithms.
That is all for these visualisations. I hope these charts helped give you an intuitive understanding of how machine learning models classify (or at least showed some interesting shapes).