Get Data and Set Up the Environment#

In this project we download public transport data and install several Python packages for its processing. Some basic knowledge in Python programming is required for this project.

Download Timetable Data#

Timetable data for public transport operators in Germany is available in GTFS format.

Task: Go to gtfs.de. Find available GTFS feeds. What types of transport are contained in each feed? What time periods are covered by the data? Are we allowed to use the data?

Solution:

# your answers

Task: Download all available data from gtfs.de. Note download URLs and terminal commands (if you use the terminal).

Hint

For download via terminal in Linux use

curl URL -o DESTINATION_FILE_NAME

Solution:

# your notes

Download OpenStreetMap Data#

To compute walking distances between neighboring public transport stops we’ll use data from OpenStreetMap (OSM). The OSM website provides download of (too) small regions or the whole planet (about 60 GB). Geofabrik GmbH provides regional downloads.

Task: Check OSM licence information. Then download OSM data for Europe in PBF format (Germany is not enough, because GTFS data may contain stops in neighboring countries, if German trains cross borders). Note the download URL and terminal commands.

Solution:

# your notes

Extract Region of Interest from OSM Data#

Extracting walking distances from OSM data requires a lot of memory. Memory consumption grows with size of the region under consideration. Thus, we should extract our region of interest from Europe’s OSM file.

Task: Find minimum and maximum latitude and longitude of your region of interest (go to OSM and look at the coordinates of some object on the border of your region of interest).

Solution:

# your answer

There exist many tools for processing OSM data. A very handy one is Osmosis. You may use it as Python package or in terminal. The terminal command for data extraction is

osmosis --rb file=SOURCE_FILE --bb left=... right=... top=... bottom=... --wb file=DESTINATION_FILE

Task: Extract your region of interest with Osmosis. Note the full terminal command.

Solution:

# your notes

Conda Environment for GTFS Processing#

We want to use the gtfspy Python package. It’s unmaintained since 2019 (at least). Thus, installation is tricky due to outdated dependencies. But it’s a nice package including fast public transport routing. It has been developed for creating A collection of public transport network data sets for 25 cities (also see corresponding GitHub repo).

To avoid messing up your everyday Conda environment with failed installations and broken dependencies create a new Conda environment for this project.

Task: Create a new Conda environment gtfs. If working on Gauss, don’t forget to create a corresponding ipykernel for Jupyter and to switch your notebook’s kernel to the new one.

Solution:

# your notes

Install osmread#

The gtfspy package depends on osmread package. But osmread isn’t available via Conda. Via PyPI (that is, pip) we get an older version with outdated (unsatisfyable) dependencies. Thus, we have to install osmread from source.

Task: Find out what the following commands do. For each line write a short comment. Then run the commands (works on Linux, macOS and Co.; for Windows minor modifications may be required).

conda activate gtfs
pip install argparse lxml protobuf==3.20.1
git clone https://github.com/dezhin/osmread.git
cd osmread
python setup.py install
cd ..
rm -r osmread

Solution:

# your notes

Install gtfspy#

The gtfspy package comes with outdated dependencies and several programming errors. Thus, we install it from source as a local package in our working directory. This way we may easily fix issues when they pop up.

Task: Find out what the following commands do. Why do we need the mv commands? For each line write a short comment. Then run the commands (works on Linux, macOS and Co.; for Windows minor modifications may be required).

pip install pandas networkx pyshp nose Cython shapely pyproj mopy geoindex geojson matplotlib-scalebar
git clone https://github.com/CxAalto/gtfspy.git
mv gtfspy gtfspy_gitrepo
mv gtfspy_gitrepo/gtfspy gtfspy
rm -r gtfspy_gitrepo

Solution:

# your notes

Patch gtfspy#

The gtfspy package uses several outdated library functions (mainly from networkx package) and contains some programming errors. Some patching is in order…

Task: Implement the modifications listed below and think about why they could be necessary (make short notes).

Solution:

# your notes

in gtfspy/osm_tranfer.py:

  • replace (line 91)

      network_nodes = walk_network.nodes(data="true")
    

    by

      network_nodes = walk_network.nodes(data=True)
    
  • replace (line 139)

          walk_network.add_path(way.nodes)
    

    by

          networkx.add_path(walk_network, way.nodes)
    
  • replace (line 143-145)

      for node, degree in walk_network.degree().items():
          if degree is 0:
              walk_network.remove_node(node)
    

    by

      nodes_to_remove = []
      good_nodes = networkx.get_node_attributes(walk_network, 'lat').keys()
      for node, degree in walk_network.degree():
          if degree == 0:
              nodes_to_remove.append(node)
          elif node not in good_nodes:
              nodes_to_remove.append(node)
      for node in nodes_to_remove:
          walk_network.remove_node(node)
    

    (good_nodes contains all nodes with lat/lon data; nodes without data presumably belong to ways crossing the map’s border (some nodes dropped by Osmosis, but way not shortened); prevents index errors when computing edge lengths some lines below)

in gtfspy/networks.py:

  • replace (lines 267-270):

      events_df.drop('to_seq', 1, inplace=True)
      events_df.drop('shape_id', 1, inplace=True)
      events_df.drop('duration', 1, inplace=True)
      events_df.drop('route_id', 1, inplace=True)
    

    by

      events_df.drop('to_seq', axis=1, inplace=True)
      events_df.drop('shape_id', axis=1, inplace=True)
      events_df.drop('duration', axis=1, inplace=True)
      events_df.drop('route_id', axis=1, inplace=True)
    

gtfspy/routing/node_profile_multiobjective.py (line 78):

  • replace

              assert dep_time_index is 0, "first dep_time index should be zero (ensuring that all connections are properly handled)"
    

    by

              assert dep_time_index == 0, "first dep_time index should be zero (ensuring that all connections are properly handled)"
    

Create GTFS Data Base#

To speed up routing gtfspy stores all data in an SQLite data base. That’s a usual file with extension sqlite. First step in working with gtfspy is to create the data base containing all relevant GTFS feeds.

Task: Have look at the import_gtfs function in gtfspy’s import_gtfs module. Use this function to transfer GTFS feeds of interest to you to an SQLite data base.

Solution:

# your solution

Extract Region from GTFS Data Base#

If imported GTFS data covers a much larger region than the region you are interested in, you should filter the created data base by region. Else, routing becomes too expensive (in terms of computation time). The gtfspy package provides such filtering, but it’s expensive, too. Thus, filtering should only be used if it reduces the data base’s size significantly.

Filtering require three steps:

  1. Open the data base to filter by creating a GTFS object, defined in gtfspy’s gtfs module.

  2. Create a FilterExtract object, defined in gtfspy’s filter module.

  3. Call the FilterExtract object’s filter method.

Task: Have look at gtfspy’s source to learn how to use the above mentioned objects and functions. Then filter the data base by region (hint: ‘buffer zone’ in gtfspy's source is the region of interest).

Solution:

# your solution

Add OSM Walking Distances to Data Base#

To get more realistic walking times between neighboring stops we may extract walking distances from OpenStreetMap. This step is optional. It requires a lot of memory and computation time, because the whole walk network (all walkable paths and streets) is extracted from the OSM file. Use OSM walking distances for small regions only. Without OSM data Euclidean distance are used.

Task: Have look at add_walk_distances_to_db_python in gtfspy’s osm_transfer module. Then use this function to get OSM walking distances. If your region is too large, have a look at hint below this task.

Solution:

# your solution

Hint

Without OSM walking distances the routing algorithm will complain about missing the key d_walk in a dictionary. That’s presumably a bug. Workaround: Whenever you use your data base (without OSM distances) for routing, add the following lines to your code:

for u, v, data in walk_network.edges(data=True):
    data['d_walk'] = data['d']

Here walk_network is an object representing the walk network stored in the data base. It will be created as preparative step for routing and then passed to the routing algorithm. Place the code between creation of the walk network and passing the walk network to the routing algorithm.

If you use these two lines of code with OSM distance, OSM distances will be overwritten with Euclidean distances.

Use the Data Base#

To use the SQLite data base we have to create a GTFS object, definded in gtfspy’s gtfs module. This object then provides lots of methods for accessing the data.

Task: Have a look at an GTFS objects stops, get_min_date, get_max_date methods. Call them to get a list of all stops and the date range covered by the GTFS data.

Solution:

# your solution