What is Web Scraping?

Web scraping is the process of extracting data from a website for the purpose of research, business intelligence, and other operations. Generally, we can get data from the internet to get more information to enable us to make informed decisions. There are many ways to do it. Alteryx is one of them.

 

How can we use Alteryx for Web Scraping?

The answer is Regex. We can use the Regex tool in Alteryx to extract the desired data based on the HTML code of the website you want to scrape. In this blog, I’m going to take IMDB’s Top 250 Movies as an example to show how to apply Alteryx tools for Web Scraping.

 

Steps to scrape the IMDB’s Top 250 Movies of All Time

a) Extract the HTML code of the website you want to scrape

IMDB Website: https://www.imdb.com/chart/top/

Add the above URL in the TEXT INPUT tool and Apply a DOWNLOAD tool to get the HTML code.

Figure 1. Input Data

 

b) Inspect the desired table

Go to the website and Right-click the inspect. You can find that the table we wanted is written in the HTML format of <tbody></tbody>.

Figure 2. Inspect Table

 

c) Extract the whole table

Regular Expression: <tbody class=”lister-list”>(.*?)</tbody>

After applying the above formula, you can retrieve all the HTML code within <tbody>.

Figure 3. Data Extraction by Regex – Step 1

 

d) Tokenize each film and Split them into rows

From the below GIF, we can see that each film is located in the <tr>.

GIF 1. Find the location of each movie

 

Regular Expression: <tr>(.*?)</tr>

Then we can apply this formula to extract each movie and arrange them at the row level.

Figure 4. Data Extraction by Regex – Step 2

 

e) Retrieve the Title, Rating, and Votes

After tokenization, we can see all the desired information is in the <tbody> column.

GIF 2. Find the required information

Regular Expression: <a href.*>(.*)</a>

Utilizing the above expression, we can get the titles.

Figure 5. Data Extraction by Regex – Step 3

 

Regular Expression: <strong.*”(.*) based on (.*) user.*</strong>

After using the formula, we can get each movie’s rating and votes.

Figure 6. Data Extraction by Regex – Step 4

 

f) Data Cleansing

Finally, remove the punctuation on the votes column and select the required dimensions.

Figure 7. Data Cleansing

 

g) Entire Workflow

Figure 8. Entire Workflow

 

If there are any problems, please feel free to point them out. Besides, you can reach out to me on LinkedIn. I will try my best to answer your questions about Tableau or Alteryx.

 

Joe Chan
Author: Joe Chan

Joe has an IT background with a master's degree in UNSW, majoring in AI and Data Science. During his studies, he realized Data is one of the most valuable assets a business can have and potentially has a tremendous impact on its long-term success. After graduation, his desire to level up his data analytics skills led him to join The Data School. He is interested in Data Wrangling, Data Visualization, and Machine Learning, eager to be a great Data Analyst to help businesses grow.