Text Files
Contents
Text Files#
Text files are the most basic type of files. They contain string data. Historically there was a one-to-one mapping between byte values (0…255) and characters. Nowadays things are much more complex, because representing all the world’s languages requires more than 256 different characters. When reading from and writing to text files the mapping between characters and there numerical representation in memory or storage devices is of uttermost importance.
Text file not only contain so called pritnable characters like letters and numbers, but also control characters like line breaks and tab stops. Related issues will be discussed in this chapter, too.
Encodings#
Every kind of data has to be converted to a stream of bits. Else it cannot be processed by a computer. For strings we have to distinguish between their representation on screen (which symbol) and their representation in memory (which sequence of bits). Mapping between screen and memory representation is known as encoding. Decoding is mapping in opposite direction.
ASCII#
Historically, each character of a string has been encoded as exactly one byte. A byte can hold values from 0 to 255. Thus, only 256 different characters are available, including so called control characters like tabs and new line characters.
The mapping between byte values and characters, the so called character encoding, has to be standardized to allow exchanging text files. For a long time, the most widespread standard has been ASCII (American Standard Code for Information Interchange). But since ASCII does not contain special characters like umlauts in other languages, several other encodings were developed. The ISO 8859 family is a very prominent set of ASCII derivates.
The first 128 characters of almost all encodings coincide with ASCII, but the remaining 128 contain different symbols. Thus, to read text files one has to know the encoding used for saving the file. Typically, the encoding is not (!) saved in the file, but has to be guessed or communicated along with the file. Have a look at the list of encodings Python can process.
Unicode#
Nowadays, Unicode is the standard encoding. More precisely, Unicode defines a group of encodings. We do not go into the details here. For our purposes it suffices to know that Unicode contains several hundred thousand symbols and the most important encoding of Unicode is called UTF-8. The eight means that most characters require only 8 bits. The symbols associated with the byte values 0 to 127 coincide with ASCII. A byte value above 127 indicates a multi-byte symbol comprising two, three, or four bytes.
Linux/Unix/macOS
Non-Windows systems (Linux, Unix, macOS) have native UTF-8 support for decades. It’s the standard encoding for Websites and other internet related applications.
Windows
Windows, even Windows 10, uses a different Unicode encoding under the hood and supports UTF-8 at the surface only. Sometimes, if one has to dig deeper into the system, unexpected things may happen. Older Windows version did not have UTF-8 support at all. Always check the encoding if you work with text data generated on a Windows system!
Encodings in Python#
Python uses UTF-8 and strictly distinguishs between strings and their encoded representation. The string is what we see on screen, whereas the encoded form is what is written to memory and storage devices.
String objects provide the encode
member function. This function returns a sequence of bytes. This sequence is of type bytes
. A bytes
object is immutable. In essence, it’s a tuple of integers between 0 and 255.
The other way round bytes
objects provide a member function decode
to transform them to strings.
a = 'some string with umlauts: ä, ö, ü'
b = a.encode()
print(b)
b'some string with umlauts: \xc3\xa4, \xc3\xb6, \xc3\xbc'
As we see, bytes
objects can be specified like strings, but prefixed by b
. The only difference is that all bytes holding values above 127 or non-printable characters (line breaks, for instance) are replaced by their integer values in hexadecimal notation with the prefix \x
, which is the escape sequence for specifying characters in hexadecimal notation. If we want to use octal notation, the escape sequence is \000
where 000
is to be replaced by a three digit octal number.
c = b.decode()
print(c)
some string with umlauts: ä, ö, ü
Note
The encode
and decode
methods accept an optional encoding
parameter, which defaults to 'utf-8'
.
There is also a mutable version of bytes
objects: bytearray
objects. They provide a decode
function, too.
Reading from a file opened in text mode is equivalent to reading after opening in binary mode followed by a call to decode
. Similarly for writing. The open
function has knowns an optional encoding
parameter for text mode, defaulting to 'utf-8'
.
Line Breaks#
Encoding line breaks in text files is done differently on different operating systems. The ASCII and Unicode standards define two symbols indicating a line break. One is symbol 10, known as line feed (LF for short). The other is symbol 13, known as carriage return (CR for short).
Historically, when typewriters were the standard text processing tools, starting a new line required two actions: move to next line without moving the carriage, then move the carriage to its rightmost position. Thus, there are two different symbols for these two actions.
Linux/Unix/macOS
Linux and other Unix like system (macOS, for instance) use single byte line breaks encoded by LF. Old versions of macOS used CR, but then developers switched to LF.
Windows
Windows adhers to the two-step legacy from pre-computer era. That is, on Windows line breaks in text data are encoded by the two bytes CR and LF.
Python can handle all three versions of line break codes (LF, CR, CR LF) and tries to hide the differences from the programmer. But be aware, that writing text files may produce different results on Windows and Linux/Unix/macOS machines.
Encoding Problem Examples#
import os.path
Wrong Encoding#
If we open an ISO 8859-1 encoded text file without specifying an encoding (that is, UTF-8 is used), the interpreter fails either fails to interpret some bytes or it shows wrong symbols.
f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')
text = f.read()
f.close()
print(text)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Input In [4], in <cell line: 2>()
1 f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')
----> 2 text = f.read()
3 f.close()
5 print(text)
File ~/anaconda3/envs/ds_book/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
319 def decode(self, input, final=False):
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 24: invalid continuation byte
If we open an UTF-8 encoded file with ISO 8859-1 decoding we see garbled symbols.
f = open(os.path.join('testdir', 'utf-8.txt'), 'r', encoding='iso-8859-1')
text = f.read()
f.close()
print(text)
Some umlauts: ä, ö, ü.
This file is UTF-8 encoded.
Writing Line Breaks#
The following code produces different files on Linux/Unix/macOS and Windows.
f = open(os.path.join('testdir', 'testwrite.txt'), 'w')
text = f.write('test\n\n\n\n\n\n\n\n\n\ntest')
f.close()
On Linux and Co. the file will have 18 bytes. On Windows it will have 28 bytes due to Windows’ 2-byte line breaks. Opening the file in binary mode shows the line break encoding:
f = open(os.path.join('testdir', 'testwrite.txt'), 'rb')
text = f.read()
f.close()
print(tuple(text))
(116, 101, 115, 116, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 116, 101, 115, 116)
Using print(text)
directly shows line breaks as \n
, which is nice almost always, but not here. So we convert the bytes object to a tuple of integers before printing.
If the file has been writen on a Windows machine, it looks like that:
f = open(os.path.join('testdir', 'testwrite-windows.txt'), 'rb')
text = f.read()
f.close()
print(tuple(text))
(116, 101, 115, 116, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 116, 101, 115, 116)