Saving and Loading Non-Standard Data#

There exist Python modules for almost all standard file formats. Readers and writers for several formats also are included in larger packages like matplotlib, opencv, pandas. To share data with others always use some standard file format (PNG or JPEG for images, CSV for tabulated data, and so one).

For storing temporary data like interim results NumPy and the pickle module from Python’s standard library provide very convenient quick-and-dirty functions. Next to those functions, in this chapter we also discuss how to read custom binary file formats.

Related projects:

MNIST Character Recognition
- The xMNIST Family of Data Sets
- Load QMNIST

Saving and Loading NumPy Arrays#

NumPy provides functions for saving arrays to files and for loading arrays from files.

import numpy as np

One Array per File#

With np.save we can write one array to a file.

a = np.array([1, 2, 3])

np.save('some_array.npy', a)

The np.load functions reads an array from a file written with np.save:

a = np.load('some_array.npy')

print(a)

[1 2 3]

Multiple Arrays#

To save multiple arrays to one file use np.savez and provide each array as a keyword argument. The result is the same as calling save and creating an uncompressed (!) ZIP archive containing all files. File names in the ZIP archive correspond to keyword argument names.

a = np.array([1, 2, 3])
b = np.array([4, 5])

np.savez('many_arrays.npz', a=a, b=b)

Use np.load to load multiple arrays written with savez. The returned object is dict-like, that is, it behaves like a dictionary, but isn’t of type dict. Conversion to dict works as expected.

with np.load('many_arrays.npz') as data:    # data is dict-like
    a = data['a']
    b = data['b']
    
print(a)
print(b)

[1 2 3]
[4 5]

To get a compressed ZIP archive use np.savez_compressed.

Saving and Loading Arbitrary Python Objects#

The pickle module provides functions for pickling (saving) and unpickling (loading) almost arbitrary Python objects to and from files, respectively. For details on what objects are picklable see documentation of the pickle module.

import pickle

There exist two interfaces: either use the functions dump and load or create a Pickler and an Unpickler object. Here we only discuss the former variant. For the latter see pickle module in Python’s documentation.

Pickling#

Steps for pickling are:

Open a file for writing in binary mode.
Call dump for each object to pickle.
Close the file.

some_object = [1, 2, 3, 4]
another_object = 'I\'m a string.'

with open('test.pkl', 'wb') as f:
    pickle.dump(some_object, f)
    pickle.dump(another_object, f)

Unpickling#

Steps for unpickling are:

Open the file for reading in binary mode.
Call load for each object to unpickle.
Close the file.

with open('test.pkl', 'rb') as f:
    some_object = pickle.load(f)
    another_object = pickle.load(f)

print(some_object)
print(another_object)

[1, 2, 3, 4]
I'm a string.

Unpickling objects from unknown sources is a security risk. See pickle’s documentation.

(Un)Pickling many Objects#

If you have many objects to pickle, create a list of all objects and pickle the list. The advantage is, that for unpickling you do not have to remember how many objects you have pickled. Simply unpickle the list and look at its length.

Reading Custom Binary File Formats#

Sometimes data comes in custom binary formats for which no library functions exist. To read data from binary files we have to know how to interpret the data. Which bytes represent text? Which bytes represent numbers? And so on. Without format specification binary files are almost useless.

Viewing Binary Files#

To view binary files use a hex editor. A hex editor shows a file byte by byte, where each byte is shown as two hexadecimal digits. If you do not have a hex editor installed, try wxHexEditor.

Fig. 26 A hex editor shows file contents in hexadecimal notation and as ASCII characters (right column) together with common interpretations (lower panel).#

Most binary files are composed of strings, bit masks, integers, floats, and padding bytes. The hex editor shows common interpretations of bytes at current cursor position.

Reading Strings#

We already discussed decoding binary data to strings in the chapter on Text Files. The only question is how to find the end of a string. This question should be answered in the format specification. Usually string data is terminated by a byte with value 0.

Reading Bit Masks#

Bit masks are bytes in which each bit describes a truth value. To extract a bit from a byte all programming languages provide bitwise operators. Here we interpret a byte as sequence of 8 bits. Following bitwise operations can be used:

a & b returns 1 at a bit position if and only if a and b are both 1 at this position (bitwise and).
a | b returns 1 at a bit position if and only if at least one of a and b is 1 at this position (bitwise or)
a ^ b returns 1 at a bit position if and only if exactly one of a and b is 1 at this position (bitwise exclusive or)
~a returns 1 at a bit position if and only if a is 0 at this position (bitwise not)

Python implements these bitwise operators for signed integers, which results in somewhat unexpected results (but it’s the only way since Python has no unsigned integers). Thus, better use NumPy’s types.

To read the third bit use & 0b00100000:

# some integer to be interpreted as bit mask (prefix 0b indicates binary notation)
bit_mask = np.uint8(0b10111100)

# get bit and convert result from int to bool
third_bit = bool(bit_mask & np.uint8(0b00100000))

third_bit

True

To set the third bit to 1 (when writing binary files) use | 0b00100000.

# some integer to be interpreted as bit mask (prefix 0b indicates binary notation)
bit_mask = np.uint8(0b10011100)

# update bit mask (set third bit without modifying others)
bit_mask = bit_mask | np.uint8(0b00100000)

bin(bit_mask)

'0b10111100'

To set the third bit to 0 (when writing binary files) use & ~0b00100000.

# some integer to be interpreted as bit mask (prefix 0b indicates binary notation)
bit_mask = np.uint8(0b10111100)

# update bit mask (set third bit without modifying others)
bit_mask = bit_mask & ~np.uint8(0b00100000)

bin(bit_mask)

'0b10011100'

Reading Integers#

Integer values in a binary file may have different lengths, starting from 1 byte upto 8 byte. Reading a 1-byte-integer is very simple. Just read the byte. For two-byte integers things become more involved. There is a first (closer to begin of file) and a second byte and there is no universally accepted rule for converting two bytes to an integer. Denoting the first byte by \(a\) and the second by \(b\) there are two possibilities:

\(a+256\,b\quad\) (least significant byte first, little endian, Intel format)
\(256\,a+b\quad\) (most significant byte first, big endian, Motorola format)

If we have 4-byte integers, the problem persists. With bytes \(a\), \(b\), \(c\), \(d\) we have

\(a+256\,b+256^2\,c+256^3\,d\quad\) (little endian)
\(256^3\,a+256^2\,b+256\,c+d\quad\) (big endian)

Analogously for 8-byte integers.

NumPy provides the fromfile function to read integers and other numeric data from binary files. Next to offset (starting position) and count (number of items to read) it has a dtype keyword argument. Usual Python and NumPy types are allowed, but more detailed type control is possible by providing a string consisting of:

'<' (little endian) or '>' (big endian) and
'i' (signed integer) or 'u' (unsigned integer) and
length of item in bytes.

Reading unsigned 32-bit integers in little endian notation would require '<u4', for instance.

If data is already in memory, use frombuffer instead of fromfile.

data = bytes([200, 3, 4, 5])

# 4 unsigned 8-bit integers
a = np.frombuffer(data, 'u1')
print(a)

# 4 signed 8-bit integers
a = np.frombuffer(data, 'i1')
print(a)

# 2 unsigned 16-bit integers (little endian)
a = np.frombuffer(data, '<u2')
print(a)

# 2 unsigned 16-bit integers (big endian)
a = np.frombuffer(data, '>u2')
print(a)

# 1 unsigned 32-bit integer (big endian)
a = np.frombuffer(data, '>u4')
print(a)

# 1 signed 32-bit integer (big endian)
a = np.frombuffer(data, '>i4')
print(a)

[200   3   4   5]
[-56   3   4   5]
[ 968 1284]
[51203  1029]
[3355640837]
[-939326459]

See Byte-swapping for more detailes on NumPy’s support of endianess.

Saving and Loading Non-Standard Data

Contents