HTML almost always has a code structure which will have start tags and end tags which describe the code that goes inside, for example when defining a paragraph in a page we use <p> which will then always end with </p>. So the basics are that we start with something in these tags <p> and then almost always ends with closing tags defined by a slash </p>.
The structure of a HTML website will start will have a head element at the top followed by a body element, inside a head element you will find meta data on the following website anything that describes a website to a web crawler for display on search engines as well as titles to show on tabs. Sometimes this data is useful and if your looking for it make sure to target the head element. Next comes the body which includes basically most of the website and what will be targeted for most web scraping though its best to target inside it.
When it comes to web scraping, most of the time there are specific elements in a page that we want and we want to ignore the rest of the page. For most modern websites we can piggyback off the CSS system in order to achieve this. CSS (Cascading style sheets) are used to format and style a webpage and they target HTML blocks through identifiers and classes.
- Identifiers or id is used in html for unique parts of a HTML block so basically there is only ever one id=”exampleid” per block, there can be many ids in a website to identify code blocks but they will always be unique from each other.
- The other way to identify HTML blocks on a website is to use a class element. This is written as class=”exampleclass” found within any tags, this is not unique and will be added to multiple blocks that require the same styling.
When parsing in Alteryx, I find its good to use the class & id system already in place to target the data you want. Either target a specific that may have an ID or target multiple things within a class. Either way its a good thing to know when parsing out how to target specific blocks within a HTML file.