The topic of this week is data modelling. To create a high performance model, feature engineering is one of the most crucial aspects. To be more specific, as we all know most model only accept numeric data. So how to apply the most suitable encoding method to different type of features? Today I will take Salary data as example to explain how to encoding features.

The data set contains the following features:

Clearly we need to encoding all String features, including Workclass, Education, Martial_Status and so on. Let us go through them one by one.

The first encoding method is binary encoding, which is one of the most common encoding method. If a feature only contain two values or we use ‘yes’ and ‘no’ to describe this feature, then binary encoding could be a idea choice. In this case, binary encoding is suitable for Target and Sex, as they both have only to value: <50K, >50K and Male, Female. After the encoding these features were turned to below format.

Then let us move forward to Race. The encoding method be applied to this feature is label-encoding, we will assign different value to different labels in sequence. In this case we use number 0 to 4 to represent races.

The last feature need to be encoded is country, which is the most tricky one. After analysis it is obvious that country has many kinds of labels and there is no relationship among them. In this case it is not wise to simply encode them into sequential number because ML algorithms regard them as attribute of significance. To be more specific, they will think higher number is better than lower number.

One-Hot encoding, also known as dummy encoding is the most suitable encoding method to this kind of feature. It uses N-bit status registers to encode N states, each state has its own register bit, and only one bit is valid at any time. Take country as example, each label(US, Thailand, Australia and so on) will has an individual feature and the value are 1 or 0, standing for Yes or no.

In conclusion, binary encoding, label encoding and One-Hot encoding are three mainstreaming encoding method. Applying them to suitable features will help your model reach a higher accuracy.

Chen Zhang
Author: Chen Zhang