Text Files#

Text files are the most basic type of files. They contain string data. Historically there was a one-to-one mapping between byte values (0…255) and characters. Nowadays things are much more complex, because representing all the world’s languages requires more than 256 different characters. When reading from and writing to text files the mapping between characters and there numerical representation in memory or storage devices is of uttermost importance.

Text file not only contain so called pritnable characters like letters and numbers, but also control characters like line breaks and tab stops. Related issues will be discussed in this chapter, too.

Encodings#

Every kind of data has to be converted to a stream of bits. Else it cannot be processed by a computer. For strings we have to distinguish between their representation on screen (which symbol) and their representation in memory (which sequence of bits). Mapping between screen and memory representation is known as encoding. Decoding is mapping in opposite direction.

line drawing showing two persons discussing about the need of a 15th unifying standard already given 14

Fig. 25 Fortunately, the charging one has been solved now that we’ve all standardized on mini-USB. Or is it micro-USB? Shit. Source: Randall Munroe, xkcd.com/927#

ASCII#

Historically, each character of a string has been encoded as exactly one byte. A byte can hold values from 0 to 255. Thus, only 256 different characters are available, including so called control characters like tabs and new line characters.

The mapping between byte values and characters, the so called character encoding, has to be standardized to allow exchanging text files. For a long time, the most widespread standard has been ASCII (American Standard Code for Information Interchange). But since ASCII does not contain special characters like umlauts in other languages, several other encodings were developed. The ISO 8859 family is a very prominent set of ASCII derivates.

The first 128 characters of almost all encodings coincide with ASCII, but the remaining 128 contain different symbols. Thus, to read text files one has to know the encoding used for saving the file. Typically, the encoding is not (!) saved in the file, but has to be guessed or communicated along with the file. Have a look at the list of encodings Python can process.

Unicode#

Nowadays, Unicode is the standard encoding. More precisely, Unicode defines a group of encodings. We do not go into the details here. For our purposes it suffices to know that Unicode contains several hundred thousand symbols and the most important encoding of Unicode is called UTF-8. The eight means that most characters require only 8 bits. The symbols associated with the byte values 0 to 127 coincide with ASCII. A byte value above 127 indicates a multi-byte symbol comprising two, three, or four bytes.

Linux/Unix/macOS

Non-Windows systems (Linux, Unix, macOS) have native UTF-8 support for decades. It’s the standard encoding for Websites and other internet related applications.

Windows

Windows, even Windows 10, uses a different Unicode encoding under the hood and supports UTF-8 at the surface only. Sometimes, if one has to dig deeper into the system, unexpected things may happen. Older Windows version did not have UTF-8 support at all. Always check the encoding if you work with text data generated on a Windows system!

Encodings in Python#

Python uses UTF-8 and strictly distinguishs between strings and their encoded representation. The string is what we see on screen, whereas the encoded form is what is written to memory and storage devices.

String objects provide the encode member function. This function returns a sequence of bytes. This sequence is of type bytes. A bytes object is immutable. In essence, it’s a tuple of integers between 0 and 255.

The other way round bytes objects provide a member function decode to transform them to strings.

a = 'some string with umlauts: ä, ö, ü'
b = a.encode()
print(b)
b'some string with umlauts: \xc3\xa4, \xc3\xb6, \xc3\xbc'

As we see, bytes objects can be specified like strings, but prefixed by b. The only difference is that all bytes holding values above 127 or non-printable characters (line breaks, for instance) are replaced by their integer values in hexadecimal notation with the prefix \x, which is the escape sequence for specifying characters in hexadecimal notation. If we want to use octal notation, the escape sequence is \000 where 000 is to be replaced by a three digit octal number.

c = b.decode()
print(c)
some string with umlauts: ä, ö, ü

Note

The encode and decode methods accept an optional encoding parameter, which defaults to 'utf-8'.

There is also a mutable version of bytes objects: bytearray objects. They provide a decode function, too.

Reading from a file opened in text mode is equivalent to reading after opening in binary mode followed by a call to decode. Similarly for writing. The open function has knowns an optional encoding parameter for text mode, defaulting to 'utf-8'.

Line Breaks#

Encoding line breaks in text files is done differently on different operating systems. The ASCII and Unicode standards define two symbols indicating a line break. One is symbol 10, known as line feed (LF for short). The other is symbol 13, known as carriage return (CR for short).

Historically, when typewriters were the standard text processing tools, starting a new line required two actions: move to next line without moving the carriage, then move the carriage to its rightmost position. Thus, there are two different symbols for these two actions.

Linux/Unix/macOS

Linux and other Unix like system (macOS, for instance) use single byte line breaks encoded by LF. Old versions of macOS used CR, but then developers switched to LF.

Windows

Windows adhers to the two-step legacy from pre-computer era. That is, on Windows line breaks in text data are encoded by the two bytes CR and LF.

Python can handle all three versions of line break codes (LF, CR, CR LF) and tries to hide the differences from the programmer. But be aware, that writing text files may produce different results on Windows and Linux/Unix/macOS machines.

Encoding Problem Examples#

import os.path

Wrong Encoding#

If we open an ISO 8859-1 encoded text file without specifying an encoding (that is, UTF-8 is used), the interpreter fails either fails to interpret some bytes or it shows wrong symbols.

f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')
text = f.read()
f.close()

print(text)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Input In [4], in <cell line: 2>()
      1 f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')
----> 2 text = f.read()
      3 f.close()
      5 print(text)

File ~/anaconda3/envs/ds_book/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 24: invalid continuation byte

If we open an UTF-8 encoded file with ISO 8859-1 decoding we see garbled symbols.

f = open(os.path.join('testdir', 'utf-8.txt'), 'r', encoding='iso-8859-1')
text = f.read()
f.close()

print(text)
Some umlauts: ä, ö, ü.
This file is UTF-8 encoded.

Writing Line Breaks#

The following code produces different files on Linux/Unix/macOS and Windows.

f = open(os.path.join('testdir', 'testwrite.txt'), 'w')
text = f.write('test\n\n\n\n\n\n\n\n\n\ntest')
f.close()

On Linux and Co. the file will have 18 bytes. On Windows it will have 28 bytes due to Windows’ 2-byte line breaks. Opening the file in binary mode shows the line break encoding:

f = open(os.path.join('testdir', 'testwrite.txt'), 'rb')
text = f.read()
f.close()

print(tuple(text))
(116, 101, 115, 116, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 116, 101, 115, 116)

Using print(text) directly shows line breaks as \n, which is nice almost always, but not here. So we convert the bytes object to a tuple of integers before printing.

If the file has been writen on a Windows machine, it looks like that:

f = open(os.path.join('testdir', 'testwrite-windows.txt'), 'rb')
text = f.read()
f.close()

print(tuple(text))
(116, 101, 115, 116, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 116, 101, 115, 116)