In our daily work, we will more or less encounter some data scraping requirements or collecting user reviews during operational activities and data from other vendors during competitive product analysis. When we are preparing to collect data, sometimes we need to scrape the required information on the Internet. However, when we search for relevant web scraping tutorials, most refer to web scrape by python. However, it is pretty tricky for those who are not familiar with programming. Today, I will use the Alteryx Download Tool to call the Domain Developer API to download Sydney suburb real estate data and then use tableau to make a dashboard to display the results.
Two main ways for Web crawlers
- Web scrape through the API provided by the website.
This method is the simplest and most direct way to get the required data by calling the API. The disadvantage of using API is that some websites limit the number and frequency of API calls, and users need to pay to upgrade to advanced users to get more flexible API calls.
- HTML-based data capture
This approach is not like API that has access restrictions. We can get accessing the HTML code of the web page and grabbing the data on the required node from it without restriction. However, another disadvantage of this method is that once a small structural change occurs on the web page, the crawling code may need to be rewritten.
OAuth sets up an authorization layer between the “client” and the “service provider.” The “client” cannot directly login to the “service provider” but can only log in to the authorization layer to distinguish the user from the client. The token used by the “client” to log in to the authorization layer is different from the user’s password. The user can specify the scope and validity period of the authorization layer token when logging in. After the “client” logs in to the authorization layer, the “service provider” will open the user’s stored data to the “client” according to the scope and validity of the token.
- (A) The client asks the user for authorization.
- (B) The user agrees to authorize the client.
- (C) The client uses the authorization obtained in the previous step to apply for a token from the authentication server.
- (D) After the authentication server authenticates the client, it confirms that it is correct and agrees to issue the token.
- (E) The client uses the token to apply to the resource server for resources.
- (F) The resource server confirms that the token is correct and agrees to open the resource to the client
It is evident that out of the above six steps, step B is the key, that is, how can the user authorize the client. With this authorization, the client can obtain the token and then get resources with the token. OAuth 2.0 defines four authorization methods: Authorization Code, Implicit, Resource owner password credentials, and client credentials.
Domain Developer API Introduction
Domain developer API takes client credentials as an authentication method. According to the API document(https://developer.domain.com.au/docs/v2/getting-started), I need to register an account in advance to obtain user Credentials. After registering a new account and receiving credentials information, I can utilize it to establish the OAuth2.0 authentication process and then call the developer API to download data.
- (A) The client authenticates to the authentication server and requests an access token.
- (B) After the authentication server confirms that it is correct, it provides the client with an access token.
According to the description of the API documentation (https://developer.domain.com.au/docs/v2/apis/pkg_properties_locations/references/suburbperformance_get_bynamedsuburb), I decided to use the following API to obtain suburb property sales data.
- After registering the Domain Developer API, I will get the client credential and then use this client credential to construct HTTP header information.
2. Once I obtain the Token, I can start constructing the field named Authorization: Bearer+Token and then put this field in the HTTP header. The following figure shows a complete API download workflow. Introducing the macro facilitates users to input the client credentials and filter the date range of history real estate data.
- Batch download Sydney suburb real estate sales data
- Then I built my dashboard based on the data download from the API. The dashboard allows users to drill down to investigate the Sydney suburb real estate performance and compare historical performance through multiple dimensions.(dashboard link:https://public.tableau.com/app/profile/gary.li.j.h./viz/SydneyPropertyInsightv1_0/Dashboard)
Mastering the Alteryx Web scraping technique can address most data scraping needs in daily work. Compared with Python Web scraping, the low learning costs of the Alteryx Web scraping technique can significantly save learning time and can help to improve overall work efficiency.