Special Characters#

Strings may contain characters not available on some or all keyboards, like umlauts ä, ö, ü on US keyboards. To use such special characters in string literals Python provides escape sequences \x, \u, and \U.

Character Sets#

The relation between characters in strings and their numerical representation in memory will be considered in detail in the chapter on Text Files. For the moment we content ourselves with the observation that there are several such mappings, the most prominent ones known as ASCII, the ISO 8859 family and several Unicode variants.

ASCII knows 128 different characters, the ISO 8859 family some hundred, and Unicode several million ones. Nowadays, Unicode in its UTF-8 variant is the standard mapping for numerical representations of characters. Even Windows is adopting UTF-8 more and more after backing the wrong horse for two decades.

There are lots of searchable lists of Unicode characters in the web. Wikipedia’s List of Unicode characters is a good starting point.

Unicode Characters in String Literals#

To use Unicode characters in a string literal we either may type or copy them to the source code file or we may specify the character numerically. If the numercial representation has two hexadecimal digits, use \x followed by the two digits. In case of 4 digits use \u and for 8 digit characters use \U.

print('umlauts: \xe4, \xf6, \xfc')
print('Greek: \u03b1, \u03b2, \u03b3')
print('Chinese (?): \U0002070e, \U00028cd2')
umlauts: ä, ö, ü
Greek: α, β, γ
Chinese (?): 𠜎, 𨳒

Note

Not all Unicode characters are available in each font.

Some Unicode codes do not represent concrete characters, but control reading direction (left to right or right to left) and other properties like spacing.

line drawing showing two persons discussing while reading direction flips due to unicode character

Fig. 24 Collaborative editing can quickly become a textual rap battle fought with increasingly convoluted invocations of U+202a to U+202e. Source: Randall Munroe, xkcd.com/1137#