This week was my first introduction to Power BI. It’s quite different to Tableau, with some things coming surprisingly easy like filter actions and other things harder, such as the unfamiliar syntax’s DAX and M.
However what I was impressed by was the ease of its webscraping function. Webscraping is when you use a tool to collect data from a webpage. This can save time on manual copying and pasting.
What I did
Using Power BIs webscraping tool I was able to load table data from wikipedia in seconds. You simply go to new source>web and type in the URL and it automatically has all the tables from the webpage organised for you. You just choose which is the one you wanted. Mind you power BI picked up a tonne of tables in my example, but I could see mine was the 2nd table.
Once I imported the table it just needed to be ‘Filled down’ to get rid of all the nulls which were merged cells on the wiki page and clean up a few column. Then I was ready to create a report.
What about non-tables?
While my data from above was easy to scrape, I wanted to see what would happen if I tried something outside of a table. So I had a go at google search results for ‘whale watching sydney’. What I found was that Power BI’s web tool made suggestions for a number of tables, the first of which had all but one of the rows matching my regular google search (I logged out of my gmail account), and in the right order.
I did however find some irregularities, such as power BI picking up a few of the links from the ‘people also search for’ section on google but not others.
This left me with a few options:
- Try the ‘Add Table Using Examples’ option
- Look at the HTML code and try parsing
- It could just be that my search results are augmented by my search history, I found some information online suggesting this may be the case
So I looked up the documentation and found a ‘Add Table using Examples‘ option. This allows you to type in examples of what your looking for and an algorithm picks up the rest. It was somewhat helpful. I could see that the page BI was looking at was slightly different to my view, which explained most of why their originally suggested table was different.
With option 3, you can import the raw HTML in Power BI and ordinarily I could parse this with regex in Alteryx to find out what results show up. However there is no native regex function in BI, requiring python or R scripts to function, neither of which I’m familiar with.
To summarise, I found Power BI’s webscraping tool is very fast and great for tables, and was pretty good for google search results. There may be shortcomings with regards to parsing out the raw HTML code, but you likely won’t need to do this, because the automatic tools are good enough in themselves.