What is classification?

In machine learning, classification models play a major role in data analytics. Classification models will try to draw conclusions using observed values. For example, given a set of data, the model can try to predict whether something is a fruit or a vegetable.

What are the different models?

Decision tree

A decision-tree classification algorithm trains a model by producing branching possibilities and assigning probabilities to them. This algorithm performs best when modelling nonlinear associations between classes. The method is computationally efficient but is subject to overfitting. The most common use case for the decision tree is churn analysis.

Pros
  • Easy to interpret.
  • Built-in feature selection.
Cons
  • Favors stronger features, ignoring more subtle features.
  • Performs poorly in classification machine-learning methods where the target is unbalanced.
Random forest

A random-forest classification algorithm trains a model using the results of an ensemble of randomly generated decision trees. This algorithm performs best when modelling nonlinear associations between classes. The ensemble method helps avoid problems of overfitting and underfitting but is computationally expensive. The most common use cases are for a direct-marketing-campaign response, customer contract renewal, sales lead scoring, loan default risk, and product/alternative choice

Pros
  • Better than a single decision tree at handling imbalanced targets.
  • Better than a single decision tree at capturing the effects of subtle features.
Cons
  • Results are more difficult to interpret.
  • Estimation time is longer.
XGBoost

XGBoost classification is an ensemble method that builds many decision trees to model the association between features and a target. Due to this algorithm’s boosting capability—a method by which decision trees improve each other—it is less susceptible to overfitting and underfitting. It is useful where you use many different features to train the classifier. The common use cases are customer contract renewal, sales lead scoring, loan default risk, product/alternative choice.

Pros
  • Models nonlinear associations.
  • Is less subject to overfitting and underfitting (even compared to random forest).
Cons
  • Approximates linear associations.
  • Is an expensive computation (even compared to random forest).

 

Jason Lu
Author: Jason Lu