Recently our team has learnt about web scraping using Alteryx. Alteryx allows us to send a URL of a webpage to download the HTML file sent back and then we can continue to parse this page out to get the data we need. There is a few things to know about how HTML works in order to better understand it so you can web scrape with ease. If you already know the basics of how HTML works, your probably good and this wont be much help. Before I go into what to look for in HTML structures, its best first to describe how a web page is loaded and how some features might not even show up on your web scraped result. Most modern web pages will use JavaScript to differing extents and this makes web pages flexible being able to change depending on what a user is doing or user received data. Sometimes when web scraping the process might occur before the JavaScript is run on a page leading you to have a HTML that is missing some elements you might want which is unfortunate and you might need to get that externally or through different URLs.

 

HTML almost always has a code structure which will have start tags and end tags which describe the code that goes inside, for example when defining a paragraph in a page we use <p> which will then always end with </p>. So the basics are that we start with something in these tags <p> and then almost always ends with closing tags defined by a slash </p>.

The structure of a HTML website will start will have a head element at the top followed by a body element, inside a head element you will find meta data on the following website anything that describes a website to a web crawler for display on search engines as well as titles to show on tabs. Sometimes this data is useful and if your looking for it make sure to target the head element. Next comes the body which includes basically most of the website and what will be targeted for most web scraping though its best to target inside it.

When it comes to web scraping, most of the time there are specific elements in a page that we want and we want to ignore the rest of the page. For most modern websites we can piggyback off the CSS system in order to achieve this. CSS (Cascading style sheets) are used to format and style a webpage and they target HTML blocks through identifiers and classes.

  • Identifiers or id is used in html for unique parts of a HTML block so basically there is only ever one id=”exampleid” per block, there can be many ids in a website to identify code blocks but they will always be unique from each other.
  • The other way to identify HTML blocks on a website is to use a class element. This is written as class=”exampleclass” found within any tags, this is not unique and will be added to multiple blocks that require the same styling.

 

When parsing in Alteryx, I find its good to use the class & id system already in place to target the data you want. Either target a specific that may have an ID or target multiple things within a class. Either way its a good thing to know when parsing out how to target specific blocks within a HTML file.