In this article, I demonstrate a use case where text analytics can be applied to project cost control. I created a workflow using Alteryx to group project activities according to their description; vectorized using the TF-IDF ( Term Frequency and Inverse Document Frequency) statistical measure and clustered using the K-Means algorithm.
Why I believe text analytics can be a powerful asset for cost control
I first came into close contact with the project world, more specifically, the cost control function when I was working as a project accountant for a major LNG project on the eastern seaboard. Projects were broken down into smaller deliverables or work packages thus creating a hierarchical structure called a (WBS) work breakdown structure. The lowest level is an activity or a task. Even with a sophisticated project controls system, the volume of project activities line items can be overwhelming for project managers and cost controllers.
When similar project activities across different projects and business streams are not linked in the project system via the WBS or any other coding mechanisms, it is not feasible to group 1000’s of project activity line items manually. This is where text analytics comes to the rescue and thus enabling powerful cost-related insights to be revealed which in turn can be used to improve the cost profile of projects.
The purpose of this article is far from being a comprehensive guide but more of a teaser. The approach that I have taken is relatively simple compared to the complex nature of project structures in practice. Nevertheless, the key principles remain valid, which will be my focus in this article.
The Workflow ( Part 1 )
Steps 1 & 2 are pretty self-explanatory therefore I will jump straight to step 3. I have selected a sample of 50 records using the ‘Random Sample‘ tool to test and iterate my Alteryx workflow and Python code fast. In part 2 of this article, I will show the results over the full dataset.
The Python Code
Below is the code that I have used to tokenize and vectorize (using TF-IDF) and cluster using K-Means:
Below I explain my Python workflow. Let’s jump straight to step 4 where I explain the vectorization process.
Step 4 of Python Code: Vectorization
The text needs to be converted into numbers in a meaningful way in order to perform machine learning processes. The process of encoding words is called vectorization. There are many methods for vectorizing. In this example, I have used the TF-IDF vectorizer because it is easy to use but also fit the use case.
How the TF-IDF vectorizer works
First, it performs a count of each token occurrences in the corpus ( list of project descriptions), this is the TF(Term Frequency) part. In the IDF (inverse document frequency) part, the vectorizer normalises the frequencies by weighing with diminishing importance tokens (words) that occur in the majority of documents. As a result, words that are typically common in the corpus are penalised, thus allowing real keywords (features) to be identified. Without the IDF part, the word ‘project’ for instance would be identified as a keyword, based on frequency alone when in fact it is not a keyword for grouping projects.
Learn more about TF-IDF here.
What is tokenization
Before vectorizing, we need to tokenize the entire corpus, in our case, the list of project descriptions in the dataset. Tokenization is a process whereby a string ( e.g project description ) is broken down into tokens i.e words. After the tokenization process, some pre-processing needs to occur before inputting the data into a machine learning model. In the code above, I removed stop words, because they tend to interfere with the keyword extraction process. Stop words are common words in any text of the English language such as ” here, there, to, from.. etc”. Depending on the corpus, it is also a good idea to remove additional words that may be falsely identified as keywords and therefore generate poor results.
Step 6: K-Means Clustering
In this model, observations are grouped in ‘K’ numbers of clusters based on their distance to the nearest mean. I used the vectorized features generated in step 5 and applied the K-Means algorithm to group the data into 3 clusters. For now, I have used an arbitrary number but in part 2 of this article, I take a more methodical approach to calculate the optimum K value.
Step 7 and beyond – Analysis
After adding the cluster results to the dataset, I can output directly onto my local drive using the to_csv() method in pandas without the need to use an output tool.
Insights (Part 1)
I will only go through the insights from the sample. Make sure to check out part 2 (coming soon), where we will take a look at the results from the full dataset.
To allow for comparison across project activities, I have used the unit cost on the y-axis while the x-axis has been jittered to prevent overlapping data points. The data points you see below is one of the clusters that has gone a great job at grouping project activities. Even still, subject matter expert input is required for final validation.
But let’s assume for now, that these project activities belong in the same cluster. The chart below provides quick insights into the cost profile of similar project activities. Half of the project activities in this cluster falls outside of +/(-) 1.25 times the median value. With this insight, we could investigate the data points falling outside this range, thus possibly giving us the opportunity to re-negotiate rates in order to optimise our cost profile for future activities.
In part 2, I will be sharing my results on the full dataset. A few more additions to my workflow will be:
- Add lemmatization in the pre-processing pipeline for more accurate results.
- Topic modelling using the LDA (Latent Dirichlet Allocation) algorithm.
- Compare results from topic modelling with TF-IDF and find out which one is better at labelling project activities description.