12 min read

 

This blog aims to demonstrate how to enrich your analysis or answer your business questions with web scraping. Often, traditional databases may not contain all the data you need to make an informed decision, in which case web scraping becomes handy. Web scraping refers to the extraction of data from websites. And since the internet is so abundant with data, web scraping is a great method for finding information to enrich your analysis or answer a question with information non-existent in your local database.

As a case study, we will try to answer the question “What Tools Need to Be Learned to Become a Data Analyst?“. We currently have no data on this topic at all, so we are going to answer this question using web scraping! First, we will discuss business understanding and data understanding, and why it is essential to know the structure of the data you are trying to extract. Second, we will demonstrate how to collect and extract web data in Alteryx. Thirdly, we will show the steps necessary to prepare raw data into usable information using tools in Alteryx. Finally, we will visualize our results in Tableau to gain insights and answer our question.

 

Content

  1. Business Understanding and Data Understanding
  2. Data Collection
  3. Data Preparation and Outputting Data 
  4. Visualizing Results in Tableau

 

 

1. Business Understanding and Data Understanding

1.1 What are we doing?

Before we start the data collection process, it is very important to understanding the business question and goal. Remember our question was:

What Tools Need to Be Learned to Become a Data Analyst?

What is this question really about? What data or information do we need to be able to answer this question?

Of course, there are many different approaches to answering this question. In this blog, we share a specific thinking process based on the following assumptions:

  1. Data Analyst is a well-defined job role that organizations actively hire.
  2. Organizations post job ads that describe the skills or tools needed to be hired as a Data Analyst.
  3. The skills or tools can be captured by particular keywords or key phrases, such as “Tableau”.

Therefore, the goal is to:

Find Data Analyst job ads, extract keywords regarding skills or tools, and compute the relevant statistics to answer our question.

 

1.2 How do we do it?

In order to achieve the goal, we will need to understand where is the data located and what does its structure look like?

In this blog, we are going to use job search websites as our data source. If we go to any job search website, we are likely to see the following structure:

  1. On each page, there will be multiple (usually around 10~20) job ads from different companies. If we want to see more ads, we will need to move onto the next page.
  2. The job ads here will just be a very short summary containing the Job Title, the Hiring Company, the Location and maybe some key descriptions. But it is not likely to contain the full list of skills required for the job.
  3. In order to access the full description, we will need to click on the job ad, and go to each individual ad’s detailed page.

Of course, we could perform all these tasks manually…But we can be so much more efficient by performing web scraping using Alteryx!

 

 

2. Data Collection

Overview of the Data Collection Workflow

 

Step 1: Add the Job Search URL as Input Data
  1. From the In/Out Palette, drag the Text Input Tool onto the Canvas.
  2. Copy the job search URL for data analyst jobs, and paste the link into the Text Input Tool.

 

 

Step 2: Creating New Rows to Paginate the URL

Remember that each job search page only contains limited number of job ads (around 10 – 20 depending on the website). In order to extract more ads, we need to move onto the next page when we finish with the current page.

  1. From the Preparation Palette, drag the Generate Rows Tool onto the Canvas.
  2. Create New Field and name it as “Page”, and we will set the Condition Expression to” Page <= 25″. This will give us page numbers up to 25. (e.g. clicking through from the 1st page to the 25th page) I feel this should be enough to give us a good sample of results.
  3. Set the Loop Expression to Page + 1.

The results are as follows:

 

 

Step 3: Updating the URLs so That They Direct to Different Pages

The results from Step 2 shows that we have the new Page field. However, the page number in the URL still hasn’t been updated, they are all pointing to the first page (page = 1).

We can easily update them by replacing the “1” with the Page field’s value in each row.

  1. From the Preparation Palette, drag the Formula Tool onto the Canvas.
  2. Using the RegEx function REGEX_REPLACE, we can replace the “1” in the url field with the value from the Page field.
  3. Although not absolutely necessary in this scenario, but we could also make sure the Data type and Size are suitable for our purposes.

Comparing the url field with the updated paginated_url, we can see that we have correctly updated the urls!

 

 

 

Step 4: Downloading the Job Search Pages
  1. From the Developer Palette, drag the Download Tool onto the Canvas.
  2. Make sure the URL Field is set to paginated_url rather than url.

Although not necessary, it is often good practice to Cache and Run Workflow at the Download step, because when we have large amounts of data to download, we wouldn’t want to be redownloading all the data each time we run the workflow.

The results show that we have successfully downloaded the HTML data from the paginated_urls (from 1st page to the 25th page).

We can now move onto Data Preparation!

 

 

3. Data Preparation and Outputting Data

Overview of the Data Preparation Workflow

 

Step 1: Extracting Individual Job Page Id
  1. From the Parse Palette, drag the Regex Tool onto Canvas, and make sure it is connected with the previous Download Tool.
  2. Set the Column to Parse to DownloadData.
  3. Set the Format to Convert as (/job/\d{8}), this will find “/job/” followed by 8 digits, which is the individual job id that we need.
  4. Set the Output Method to Tokenize and Split to Rows.

 

 

Step 2: Keeping Only the Useful Fields
  1. It is good practice to drop fields that we no longer need, as this will reduce the size of our data and make the workflow more efficient. We can do so using the Select Tool.

 

 

Step 3: Removing Duplicate Job Ads

Sometimes the same job ad may appear multiple times, therefore we will remove jobs that have the same ids.

  1. From the Preparation Pallet, drag the Unique Tool onto canvas.
  2. Make sure the DownloadData field is checked, because that’s the column to find unique values from.

Here is what our data now looks like. We can see that we now have the partial URL containing each individual job id.

 

 

Step 4: Creating the complete job page URL
  1. From the Preparation Pallet, drag the Formula Tool onto Canvas.
  2. Concatenate the DownloadData field with the following string to create the complete job page URL.

Here is what our data now looks like.

 

 

Step 5: Downloading the Individual Job Pages
  1. Drag the Download Tool on to Canvas
  2. Make sure the URL Field is set to full_job_url.

The results show that our job pages have been downloaded as HTMLs.

 

 

Step 6: Checking if Job Ads Contain Certain Skills

If a job requires a certain skill, then that skill will be listed in the job ad. We use the Contains function to know whether a job ad contains certain skills. 

  1. From the Preparation Pallet, drag the Formula Tool onto Canvas.
  2. Create new fields in the Formula Tool as follows. These new fields will indicate whether the corresponding job ad in the same row has mentioned these skills. 

 

 

Step 7: Counting How Many Ads Each Skill Appeared In
  1. From the Transform Palette, drag in the Summarize Tool.
  2. Select all the individual job skill fields, and perform Count Non Null.

3. To make the format easier to computers to read, we will need to transpose the table. Drag the Transpose Tool onto the Canvas. Make sure all the Data Columns have been selected.

The results will look like this. The Name field refers to the skill required for the job, and the Value field refers to how many job has specified that particular skill.

 

 

Step 8: Calculating the Percentage of Appearance
  1. To calculate the percentage that a skill appears in job ads, we can simply divide its appearance count by the total number of ads we sampled. We can use the Formula Tool to implement this calculation.

The results are shown below.

 

 

Step 9: Outputting the Data 

Since we are going to visualize our results in Tableau, it would useful to save our data in the appropriate format.

  1. From the In/Out Palette, drag the Output Data Tool onto the Canvas.
  2. Set the File Format to Tableau’s .hyper.

 

 

4. Visualizing Results in Tableau

Loading the outputted .hyper file in Tableau, and with some very quick formatting, we have our answer! Apart from the traditional Excel, SQL, PowerBI, Python and Tableau are the most important skills to learn!

Although Alteryx is not currently among the top skills, I believe this blog has strongly demonstrated the power and efficiency of Alteryx! Starting with a question of which we have no data on, we were able to collect, prepare and output the necessary data and eventually answer our question in just a few steps!

 

 

 

 

Martin Ding
Author: Martin Ding