During our training at The Data School, we have done a variety of data analysis and visualisation exercises trying to solve a range of practical business problems. However, the task that may perhaps have challenged us the most was using predictive modelling to solve an HR problem. Nevertheless, the work was exciting, and we were all set to give it our best. Our task was to predict salaries for a given set of employees based on their demographic data such as age, education, marital status, occupation, sex, etc. We were given two files, the first was the training dataset which included salary data among all other attributes while the second file had exactly the same attribute structure as the first but did not contain any salary data. The values for the salary attribute were supposed to be predicted based using predictive analysis algorithms.
We started the work by taking a detailed look at the training data and doing some preliminary analysis. We then cleaned up the data and performed statistical analysis to identify missing values, means, modes, medians, standard deviation, variances, etc. of some of the numerical attributes. We also tried to identify patterns and basic correlations between two or more data attributes and their impact on the salary. After that, we normalised the data and replaced missing values by appropriate alternatives. We also performed binarization also known as one-hot encoding on a few attributes to prepare the data. In order to make sure the two files were absolutely aligned, we applied exactly the same transformation treatment to the testing dataset.
The fun began once the data was ready for further analysis. We applied various different types of machine learning models to the training data set to get the prediction going. We wanted to utilise multiple methods and then select a model based on comparative analysis to obtain the best outcome possible. A few of the models we used for the learning were Random Forest, Boosted, Decision Tree and Logistic Regression. We run multiple simulations of the models by selecting various different input parameters and also filtering out columns or attributes that we felt had a low correlation to the salary attribute being predicted. After multiple iterations we finally found that the Random Forest model provided us the best outcome with an approximate accuracy level of 87%, which was not bad after all the efforts we put in. We applied the model to the test dataset and obtained the missing salary values to complete the task.
The learning through the exercise was incredible although we knew that the knowledge we gained was just a drop in the ocean. One key thing we learnt while using machine learning to do predictive modelling is that the selection of the model is as important as the parameter settings. Also, there was no set formula or a one size fits all approach that works for all problems and situations. The solution, model and approach will depend on the problem, the context and the data available to solve it. Outcomes will also depend heavily on the preprocessing and fine tuning of the models. There is also no doubt that understanding the fundamentals of how different models work and which model might be suitable for which prediction or regression task is quite important to applying machine learning to predict outcomes!