HTML Files#

HTML (hypertext markup language) files are text files containing additional information for rendering text like font type, font size, foreground and background colors. Also images, tables and other objects may be described or referenced by a HTML document. Typically, HTML files are interpreted and rendered by web browsers. Almost all websites consist of HTML files.

In data science knowing some basic HTML is important for webscraping, that is, for automatically extracting information from websites.

HTML fundamentals#

A very basic HTML file looks like this:

<html>
    <head>
        <title>Title of webpage</title>
    </head>
    <body>
        <h1>Some heading</h1>
        <p>Text and text and more text in a paragraph.
        Here comes a <a href="http://some.where">link to somewhere</a>.</p>
    </body>
</html>

The file starts with <html> and ends with </html>. Then there is a head and a body. The head contains auxiliary information like the webpage’s title, which is often shown in the browser window’s title bar. The body contains the contents of the page.

There are many different HTML tags to influence rendering of the contents.

Headings from large to small: h1, h2, h3, h4, h5.

Paragraph: p.

Link: a with attribute href.

Table: table, tr (row inside table), td (cell inside row), and some more.

Image: img with attribute src (the URL of the image).

Invisible elements for layout control: span (inline element), div (box).

All tags have the attributes style (for specifying font size, colors and so on), id (a unique identifier for advanced style control and scripting), class (an identifier shared by several elements for advanced layout control).

Have a look at the HTML documentation for details.

Modern browsers have tools to help understand a HTML file’s structure. In Firefox or Chromium right-click some element of the webpage and click ‘Inspect’ in the pop-up menu. Then navigate through the HTML source. To see the whole HTML source code right-click and choose ‘View Page Source’.

Parsing HTML files with Python#

There are several modules available for parsing HTML files in Python. Here, parsing means to convert the textual representation into more structured Python objects. One such module is Beautiful Soup, which is not part of Python’s standard library, but has to be installed manually.

For installation use beautifulsoup4. For importing bs4 is the correct name.

import bs4

We have to create a BeautifulSoup object, whoes contructor takes a string or an opened file object as argument. The BeautifulSoup object then provides methods to find HTML tags by specifying tag name, id attribute, class attribute or one of several other properties. We do not have to write code for parsing HTML files. Instead we can search the file with BeautifulSoup’s methods.

html = '''\
<html>
    <head>
        <title>Title of webpage</title>
    </head>
    <body>
        <h1>Some heading</h1>
        <p>Text and text and more text in a paragraph.
        Here comes a <a href="http://some.where">link to somewhere</a>.</p>
    </body>
</html>
'''

soup = bs4.BeautifulSoup(html)

The find_all method returns a list of objects representing subsets of the HTML file matching the arguments passed to find_all. In the following code snippet we search for a tags, that is, for links. But we could also search for certain attribute values and other criteria. There is also a find method which returns the first occurrence only.

The objects returned by find_all and find themselves provide corresponding methods to refine search.

# find all links
links = soup.find_all('a')

print('#links:', len(links))
print('last link:', links[-1])
#links: 1
last link: <a href="http://some.where">link to somewhere</a>

See Beautiful Soup’s documentation for details.