Get Data and Set Up the Environment
Contents
Get Data and Set Up the Environment#
In this project we download public transport data and install several Python packages for its processing. Some basic knowledge in Python programming is required for this project.
Download Timetable Data#
Timetable data for public transport operators in Germany is available in GTFS format.
Task: Go to gtfs.de. Find available GTFS feeds. What types of transport are contained in each feed? What time periods are covered by the data? Are we allowed to use the data?
Solution:
# your answers
Task: Download all available data from gtfs.de. Note download URLs and terminal commands (if you use the terminal).
Hint
For download via terminal in Linux use
curl URL -o DESTINATION_FILE_NAME
Solution:
# your notes
Download OpenStreetMap Data#
To compute walking distances between neighboring public transport stops we’ll use data from OpenStreetMap (OSM). The OSM website provides download of (too) small regions or the whole planet (about 60 GB). Geofabrik GmbH provides regional downloads.
Task: Check OSM licence information. Then download OSM data for Europe in PBF format (Germany is not enough, because GTFS data may contain stops in neighboring countries, if German trains cross borders). Note the download URL and terminal commands.
Solution:
# your notes
Extract Region of Interest from OSM Data#
Extracting walking distances from OSM data requires a lot of memory. Memory consumption grows with size of the region under consideration. Thus, we should extract our region of interest from Europe’s OSM file.
Task: Find minimum and maximum latitude and longitude of your region of interest (go to OSM and look at the coordinates of some object on the border of your region of interest).
Solution:
# your answer
There exist many tools for processing OSM data. A very handy one is Osmosis. You may use it as Python package or in terminal. The terminal command for data extraction is
osmosis --rb file=SOURCE_FILE --bb left=... right=... top=... bottom=... --wb file=DESTINATION_FILE
Task: Extract your region of interest with Osmosis. Note the full terminal command.
Solution:
# your notes
Conda Environment for GTFS Processing#
We want to use the gtfspy
Python package. It’s unmaintained since 2019 (at least). Thus, installation is tricky due to outdated dependencies. But it’s a nice package including fast public transport routing. It has been developed for creating A collection of public transport network data sets for 25 cities (also see corresponding GitHub repo).
To avoid messing up your everyday Conda environment with failed installations and broken dependencies create a new Conda environment for this project.
Task: Create a new Conda environment gtfs
. If working on Gauss, don’t forget to create a corresponding ipykernel for Jupyter and to switch your notebook’s kernel to the new one.
Solution:
# your notes
Install osmread
#
The gtfspy
package depends on osmread
package. But osmread
isn’t available via Conda. Via PyPI (that is, pip
) we get an older version with outdated (unsatisfyable) dependencies. Thus, we have to install osmread
from source.
Task: Find out what the following commands do. For each line write a short comment. Then run the commands (works on Linux, macOS and Co.; for Windows minor modifications may be required).
conda activate gtfs
pip install argparse lxml protobuf==3.20.1
git clone https://github.com/dezhin/osmread.git
cd osmread
python setup.py install
cd ..
rm -r osmread
Solution:
# your notes
Install gtfspy
#
The gtfspy
package comes with outdated dependencies and several programming errors. Thus, we install it from source as a local package in our working directory. This way we may easily fix issues when they pop up.
Task: Find out what the following commands do. Why do we need the mv
commands? For each line write a short comment. Then run the commands (works on Linux, macOS and Co.; for Windows minor modifications may be required).
pip install pandas networkx pyshp nose Cython shapely pyproj mopy geoindex geojson matplotlib-scalebar
git clone https://github.com/CxAalto/gtfspy.git
mv gtfspy gtfspy_gitrepo
mv gtfspy_gitrepo/gtfspy gtfspy
rm -r gtfspy_gitrepo
Solution:
# your notes
Patch gtfspy
#
The gtfspy
package uses several outdated library functions (mainly from networkx
package) and contains some programming errors. Some patching is in order…
Task: Implement the modifications listed below and think about why they could be necessary (make short notes).
Solution:
# your notes
in gtfspy/osm_tranfer.py
:
replace (line 91)
network_nodes = walk_network.nodes(data="true")
by
network_nodes = walk_network.nodes(data=True)
replace (line 139)
walk_network.add_path(way.nodes)
by
networkx.add_path(walk_network, way.nodes)
replace (line 143-145)
for node, degree in walk_network.degree().items(): if degree is 0: walk_network.remove_node(node)
by
nodes_to_remove = [] good_nodes = networkx.get_node_attributes(walk_network, 'lat').keys() for node, degree in walk_network.degree(): if degree == 0: nodes_to_remove.append(node) elif node not in good_nodes: nodes_to_remove.append(node) for node in nodes_to_remove: walk_network.remove_node(node)
(
good_nodes
contains all nodes with lat/lon data; nodes without data presumably belong to ways crossing the map’s border (some nodes dropped by Osmosis, but way not shortened); prevents index errors when computing edge lengths some lines below)
in gtfspy/networks.py
:
replace (lines 267-270):
events_df.drop('to_seq', 1, inplace=True) events_df.drop('shape_id', 1, inplace=True) events_df.drop('duration', 1, inplace=True) events_df.drop('route_id', 1, inplace=True)
by
events_df.drop('to_seq', axis=1, inplace=True) events_df.drop('shape_id', axis=1, inplace=True) events_df.drop('duration', axis=1, inplace=True) events_df.drop('route_id', axis=1, inplace=True)
gtfspy/routing/node_profile_multiobjective.py
(line 78):
replace
assert dep_time_index is 0, "first dep_time index should be zero (ensuring that all connections are properly handled)"
by
assert dep_time_index == 0, "first dep_time index should be zero (ensuring that all connections are properly handled)"
Create GTFS Data Base#
To speed up routing gtfspy
stores all data in an SQLite data base. That’s a usual file with extension sqlite
. First step in working with gtfspy
is to create the data base containing all relevant GTFS feeds.
Task: Have look at the import_gtfs
function in gtfspy
’s import_gtfs
module. Use this function to transfer GTFS feeds of interest to you to an SQLite data base.
Solution:
# your solution
Extract Region from GTFS Data Base#
If imported GTFS data covers a much larger region than the region you are interested in, you should filter the created data base by region. Else, routing becomes too expensive (in terms of computation time). The gtfspy
package provides such filtering, but it’s expensive, too. Thus, filtering should only be used if it reduces the data base’s size significantly.
Filtering require three steps:
Open the data base to filter by creating a
GTFS
object, defined ingtfspy
’sgtfs
module.Create a
FilterExtract
object, defined ingtfspy
’sfilter
module.Call the
FilterExtract
object’sfilter
method.
Task: Have look at gtfspy
’s source to learn how to use the above mentioned objects and functions. Then filter the data base by region (hint: ‘buffer zone’ in gtfspy's
source is the region of interest).
Solution:
# your solution
Add OSM Walking Distances to Data Base#
To get more realistic walking times between neighboring stops we may extract walking distances from OpenStreetMap. This step is optional. It requires a lot of memory and computation time, because the whole walk network (all walkable paths and streets) is extracted from the OSM file. Use OSM walking distances for small regions only. Without OSM data Euclidean distance are used.
Task: Have look at add_walk_distances_to_db_python
in gtfspy
’s osm_transfer
module. Then use this function to get OSM walking distances. If your region is too large, have a look at hint below this task.
Solution:
# your solution
Hint
Without OSM walking distances the routing algorithm will complain about missing the key d_walk
in a dictionary. That’s presumably a bug. Workaround: Whenever you use your data base (without OSM distances) for routing, add the following lines to your code:
for u, v, data in walk_network.edges(data=True):
data['d_walk'] = data['d']
Here walk_network
is an object representing the walk network stored in the data base. It will be created as preparative step for routing and then passed to the routing algorithm. Place the code between creation of the walk network and passing the walk network to the routing algorithm.
If you use these two lines of code with OSM distance, OSM distances will be overwritten with Euclidean distances.
Use the Data Base#
To use the SQLite data base we have to create a GTFS
object, definded in gtfspy
’s gtfs
module. This object then provides lots of methods for accessing the data.
Task: Have a look at an GTFS
objects stops
, get_min_date
, get_max_date
methods. Call them to get a list of all stops and the date range covered by the GTFS data.
Solution:
# your solution