Web Access#

Today’s primary source of data is the world wide web. In the simplest case we may download a data set as one single file. Many data providers instead offer an API (application programming interface) for accessing and downloading data. The worst case is if we have to scrape data from a website’s HTML and other files.

Server, Client, Browser#

Websites and other web services are hosted on a server somewhere in the world. If we type an URL (web address) into a browser’s address bar, the browser connects to the corresponding server and asks him to send the desired file to the user’s computer. This process is referred to as requesting a file or sending a request. Our computer is the client, asking the server for some service (send a file). It’s important to understand that we cannot simply collect a file from a remote server. We only may send a request to the server to send a file to us. The server may fulfill our request or send an error message or do not answer the request at all.

The technology behind is much more involved than one might think: How to find the correct server? Which language to speak with the server? What to do if the server does not answer the request? And so on. If you are interested in some background details, use DNS and HTTP as entry points.

The Python interpreter may take the role of the browser and request files from servers.

Downloading Files with Python#

To download a webpage or some other file from the the web we may use the requests module from the Python standard library.

The module provides a function get which takes the URL and yields a Response object. The Response object contains information about the server’s answer to our request. If the request has been succesful, the content member variable contains the requested file as bytes object.

import requests

response = requests.get('https://www.fh-zwickau.de/~jef19jdw/index.html')

print(response.content.decode())

<!DOCTYPE html>
<html lang="en">
  <head>
	<meta name="generator" content="Hugo 0.80.0" />
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" href="https://wwwstud.fh-zwickau.de/jef19jdw/mystyle.css">
    <title>
Contact
</title>
  </head>
  <body>
    <header>
      <div class="head_col">
          <p>Prof. Dr. rer. nat. habil.</p>
          <p style="font-size: 1.8em; font-weight: bold; padding-top: 0.1em;">Jens Flemming</p>
      </div>
      <div class="head_col">
          <p style="font-weight: bold;">Inverse Problems, Regularization</p>
          <p style="padding-top: 0.5em; font-weight: bold;">Data Science</p>
      </div>
      <div class="head_col">
          <p style="text-align: right;"><a href="https://www.fh-zwickau.de"><img id="logo_whz" src="https://wwwstud.fh-zwickau.de/jef19jdw/logo_whz.svg" alt="WHZ logo"/></a></p>
      </div>
      <div id="path">
          <a href="https://www.fh-zwickau.de">Westsächsische Hochschule Zwickau</a> &gt;
          <a href="https://www.fh-zwickau.de/pti">Faculty of Physical Engineering/Computer Sciences</a> &gt;
          <a href="https://www.fh-zwickau.de/pti/organisation/fachgruppe-mathematik">Mathematics Group</a> &gt;
          Jens Flemming
      </div>
      <nav>
        <ul>
            <li class="active"><a href="https://wwwstud.fh-zwickau.de/jef19jdw/index.html">Who and Where</a></li>
            <li ><a href="https://wwwstud.fh-zwickau.de/jef19jdw/teaching/index.html">Teaching (in German)</a></li>
            <li ><a href="https://wwwstud.fh-zwickau.de/jef19jdw/research/index.html">Research</a></li>
            <li ><a href="https://wwwstud.fh-zwickau.de/jef19jdw/blog/index.html">Blog</a></li>
        </ul>
      </nav>
    </header>

    <aside>
      <nav>
        <ul>
          
          
          <li class="active"><a href="https://wwwstud.fh-zwickau.de/jef19jdw/index.html">Contact</a></li>
          <li ><a href="https://wwwstud.fh-zwickau.de/jef19jdw/cv.html">CV</a></li>
          
          
          
          
          
        </ul>
      </nav>
    </aside>
    
    <article>
<h1 id="contact">Contact</h1>
<p>Professorship of Mathematics at <a href="https://www.fh-zwickau.de">Westsächsische Hochschule Zwickau</a></p>
<p>Associate member of <a href="https://www.tu-chemnitz.de/mathematik/inverse_probleme/index.php?en=1">Research Group Regularization</a> at <a href="https://www.tu-chemnitz.de">TU Chemnitz</a></p>
<p>Course director/Studiengangleiter <a href="https://datascience.fh-zwickau.de">Bachelor Data Science</a></p>
<p><img src="flemming.jpg" alt="Jens Flemming"></p>
<p><strong>Email:</strong></p>
<p>       
jens . flemming [at] fh-zwickau . de   (without spaces and with [at] replaced properly)</p>
<p>       
for end-to-end encryption: <a href="smimecert.pem">S/MIME certificate</a>, <a href="pgpkey.asc">PGP key</a></p>
<p><strong>Office:</strong></p>
<p>       
PKB 365   (Kornmarkt 1, Paul Kirchhoff Building, room 365)</p>
<p><strong>Office hours:</strong></p>
<p>       
by prior arrangement</p>
<p><strong>Phone:</strong></p>
<p>       
0375 536 1380   (international: +49 375 536 1380)</p>
<p><strong>Secretary:</strong></p>
<p>       
<a href="https://www.fh-zwickau.de/pti/organisation/fachgruppe-mathematik/personen/elke-roeder">Elke Röder</a></p>
<p><strong>Postal address:</strong></p>
<p>       
Westsächsische Hochschule Zwickau<br>
       
Fakultät Physikalische Technik/Informatik<br>
       
Fachgruppe Mathematik<br>
       
Jens Flemming<br>
       
Postfach 201037<br>
       
D-08012 Zwickau, Germany</p>
 
</article>
        
    <footer>
        <p>
            <a href="https://wwwstud.fh-zwickau.de/jef19jdw/index.html">Contact</a> |
            <a href="https://www.fh-zwickau.de/service/impressum">Imprint</a> |
            <a href="https://www.fh-zwickau.de/service/datenschutz">Data privacy</a> |
            <a href="https://www.fh-zwickau.de/service/barrierefreiheitserklaerung">Accessibility</a> |
            last modified: 14. August 2022
        </p>
    </footer>
  </body>
</html>

Web APIs#

Many webpages are to some extent dynamic. Their content can be influenced by passing parameters to them. Different techniques exist for this purpose. Most common are so-called ‘GET’ and ‘POST’. We only consider the first method here.

Passing arguments via ‘GET’ is very simple. We just add them to the URL. If the webpage processes arguments with names arg1, arg2, arg3 and if we want to pass corresponding values value1, value2, value3, we may request the URL

http://some.where/some_page.html?arg1=value1&arg2=value2&arg3=value3

The requests.get function knows the keyword argument params to increase readability. Instead of composing a long URL string we may write:

url = 'http://some.where/some_page.html'
params = {'arg1': 'value1',
          'arg2': 'value2',
          'arg3': 'value3'}
response = requests.get(url, params=params)

Most web services for data retrieval do not return HTML document, but more machine readable formats like CSV, JSON, or YAML. There are Python modules for parsing all common formats.

Web Scraping#

Sometimes data we want to analyze is scattered over a website. No direct connection to the underlying data base is available. Thus, we have to find ways to extract data from websites automatically. The process of extracting data from websites is referred as web scraping.

Legal considerations#

There is no law which directly prohibits web scraping. But a website or part of it may be protected by copyright law. Almost all large websites have terms of use, which have to be respected by the user. Some websites explicitly prohibit automated data extraction. Some only prohibit commercial use of the provided data. Before starting a scraping project read the terms of use!

When in doubt ask the website provider for written permission to scrape data from the site or ask a lawyer!

Another issue is the web traffic caused by scrapers. A scraping project might require several thousand requests to a server within very short time. This may hurt the providers infrastructure. A common attack for getting down a website is to send thousands of requests fast enough to prevent the server from answering requests from other users (DoS attack, denial-of-service attack). We don’t want to be attackers. Thus, whenever you start a scraping project, tell your script to wait a few seconds between consecutive requests to a server!

Scraping Media Files from Websites#

Download a webpage via requests.get only yields the HTML document. Images and other media usually are not contained in HTML files. To download all images of a webpage we would have to find all img tags in the HTML file, then extract URLs from corresponding src attributes, and then download each URL separately.

Useful Little Helpers#

Scraping data from websites often is tedious work and each scraping project requires different techniques for data extraction. Knowing some little helpers may save the day.

Regular Expressions#

There is a mini language to describe query strings for text search. Corresponding search strings are called regular expressions. They can be used, for instance, in conjunction with Beautiful Soup. We do not go into the details here. Just an example:

import re    # Python's support for regular expressions

some_string = 'banana, apple, cucumber, orange'
pattern = '[aeiou].[aeiou]'    # vowel, some letter, vowel

result = re.findall(pattern, some_string)

print(result)

['ana', 'ucu', 'ora']

For details and more examples see documention of re module.

Dates and Times#

Most data contains time stamps. Python ships with the modules datetime and time for handlung dates and times. The former provides tools for carrying out calculations with dates and times. The latter provides different time-related functionality.

datetime provides objects expressing a point in time (date, time, datetime) and objects expressing a duration (timedelta).

import datetime

some_date = datetime.date(2020, 6, 23)
some_delta = datetime.timedelta(weeks=2)

new_date = some_date + some_delta

print(f'It\'s {new_date.day:02}.{new_date.month:02}.{new_date.year}.')

It's 07.07.2020.

For details see documention of datetime module.

From time module we might use time.sleep to realize some delay between subsequent requests to a server.

import time

print('Have a break...')
time.sleep(5)    # seconds
print('...now I\'m back.')

Have a break...
...now I'm back.

For details see documention of time module.

Web Access

Contents