Computers and Programming#

Computers are the main tool for data science and artificial intelligence. In this chapter we answer some basic questions:

  • What is a computer?

  • What are bits, bytes, kilobytes,…?

  • What is software?

  • What is programming and what are programming languages?

Although we don’t have to know detailed answers to these questions, we should have some rudimentary understanding of what happens inside a computer.

Related exercises: Computer Basics.

CPU, Memory, IO#

Each modern computer consists of three components: central processing unit (CPU), memory, input/output (IO) devices. These components are connected by many wires, which are organized together with some auxiliary stuff on the computer’s mainboard.

scheme showing common IO devices and data flow between them and CPU

Fig. 17 There’s a tight and fast data connection between CPU and memory. IO devices are connected to the CPU and in some cases also directly to memory.#

IO devices are all parts of the computer which provide an interface to humans like screen, keyboard, printer, scanner. But also mass storage devices (hard disk drives, SSDs, DVD drives, card readers and so on) are IO devices. Another kind of IO devices are network adapters for Ethernet, Wi-Fi, Bluetooth and others. The common feature of all IO devices is that they produce and/or consume streams of binary data. ‘Binary’ means that there are only two different values, usually denoted by 0 and 1. Electrically, 0 might stand for low voltage and 1 for high voltage.

Memory can store streams of binary data. In some sense it is similar to mass storage IO devices, but it is used in a very different way. Most storage devices are very slow and access times for reading and writing data depend on the position of the data on the device. In contrast, memory access is very fast and access times are independent of the data’s concrete location. Whenever data has to be stored for a short time only, memory is used. Due to technological reasons memory loses all data when power is turned off, whereas data on mass storage devices persists.

The CPU is a highly integrated circuit which processes streams of binary data. ‘Processing’ means that incoming data from memory and/or IO devices is transformed and then sent to memory and/or IO devices. If a binary stream from memory is interpreted as instructions by the CPU, then we say that the stream contains code. Data in the stricter sense refers to parts of binary streams that are processed by the CPU, but which do not tell the CPU what to do. Memory can contain code and data, whereas IO devices only produce non-code data.

Bits and Bytes#

A bit is a piece of binary information. It either holds a one or a zero. Less information than a bit is no information. With a sequence of \(k\) bits we can express \(2^k\) different values, for example the numbers \(0,1,\ldots,2^k-1\).

tree of 3 bit values with 3 undefined bits as root and 8 different combinations of 3 bits as leaves

Fig. 18 Wtih 3 bits we may represent 8 different values. Each additional bit doubles the number of possible values.#

bits

number of values

usual interpretation

1

2

0, 1

2

4

0, 1, 2, 3

3

8

0 … 7

4

16

0 … 15

5

32

0 … 31

6

64

0 … 63

7

128

0 … 127

8

256

0 … 255

16

65 536

0 … 65 535

24

16 777 216

0 … 16 777 215

32

4 294 967 296

0 … 4 294 967 295

By convention binary data in modern computers is organized in groups of 8 bits. A sequence of 8 bits is denoted as a byte.

Following the metric system, there are kilobytes (1000 byte), megabytes (1000 kilobyte), gigabytes (1000 megabytes), and so on with prefixes tera, peta, exa, zetta, yotta. Corresponding symbols are kB or KB, MB, GB, TB, PB, EB, ZB, YB.

In some hardware oriented fields of computer science it is common practice to use the factor 1024=210 instead of 1000. Thus, the size of a kilobyte may be 1000 or 1024 bytes. As a rule of thumb 1000 is used for data transmission and 1024 is used for memory and storage related things (except in adds for storage devices, because 1024 would give a lower number of gigabytes). Sometimes the prefixes kibi, mebi, gibi, tebi, pebi, exbi, zebi, yobi are used with corresponding symbols KiB, MiB, GiB, TiB, PiB, EiB, ZiB, YiB for factor 1024. One kibibyte, for instance, has 1024 bytes.

factor

name

symbol

bytes

1000

kilobyte

kB or KB

1 000

1024

kibibyte

KiB

1024

1000

megabyte

MB

1 000 000

1024

mebibyte

MiB

1 048 576

1000

gigabyte

GB

1 000 000 000

1024

gibibyte

GiB

1 073 741 824

1000

terabyte

TB

1 000 000 000 000

1024

tebibyte

TiB

1 099 511 627 776

1000

petabyte

PB

1 000 000 000 000 000

1024

pebibyte

PiB

1 125 899 906 842 624

1000

exabyte

EB

1 000 000 000 000 000 000

1024

exbibyte

EiB

1 152 921 504 606 846 976

1000

zettabyte

ZB

1 000 000 000 000 000 000 000

1024

zebibyte

ZiB

1 180 591 620 717 411 303 424

1000

yottabyte

YB

1 000 000 000 000 000 000 000 000

1024

yobibyte

YiB

1 208 925 819 614 629 174 706 176

Important

Computers only work with binary data. Everything has to be represented as sequences of zeros and ones. For integers, like 123, this is quite simple (see below). Rational numbers, like 0.123, may be represented by two integers, a numerator 123 and a denominator 1000 for instance. But what about text data? Or images?

Data which cannot be represented as sequence of zeros and ones cannot be processed by a computer. We’ll come back to this representation issue several times in this book.

Representation of Numbers#

Numbers may have a name, like one, two, three, four, five, six, seven, eight, nine, ten. There are even more named numbers: eleven, twelve and zero, for instance. Obviously, not all numbers can have an individual name. We need a system for automatically naming numbers and also for writing them down. At this point it is important to distinguish between numbers, which can be used for counting and computations, and their representation in spoken and written language.

In everyday life we use the decimal system based on 10 different digit symbols because we have 10 fingers. An octopus surely would invent a numbering system with only 8 digits. The Maya civilization used a 20 digits system (fingers plus toes). Computers would have invented number systems based on 2 digits, because they are representable by 1 bit, or 4 digits (2 bits) or 8 (3 bits) or 16 (4 bits).

Maya numerals with their decimal representation

Fig. 19 Maya numerals on a page of a Maya book known as Dresden Codex. Source: Sylvanus Morley via Wikimedia Commons, modified by the author.#

Positional Notation of Numbers#

There are many systems for writing down (and naming) numbers. Today the most widely used ones are positional. An example for a non-positional system are Roman numerals.

Fix a number \(b\), the basis, and take \(b\) symbols to denote the first \(b\) numbers. Here, we interpret zero as the first number, followed by one as the second number and so on. In case \(b\) is less than ten, we may use the symbols \(0,1,\ldots,9\) for the numbers from zero to nine. Every number \(c\) has a unique representation of the form

\[\begin{equation*} c=a_n\,b^n+\cdots+a_2\,b^2+a_1\,b^1+a_0\,b^0, \end{equation*}\]

where \(a_0,a_1,\ldots,a_n\in\{0,1,\ldots,9\}\) are the digits and \(n+1\) is the number of digits required to express the number \(c\) with respect to the basis \(b\). With this unique representation at hand, we may write the number \(c\) as a list of its digits: \(c=a_n\,\ldots\,a_0\). Keep in mind that the basis \(b\) has to be known to interpret a list a digits although \(b\) often is not written down explicitly.

Example: If we take, for instance, the number twelve, then with ten as base \(b\) we would have

\[\begin{equation*} \text{twelve}=1\cdot b^1+2\cdot b^0=12. \end{equation*}\]

Numbers given in base ten are denoted as decimal numbers. More exactly, one should say ‘a number in decimal representation’ since the number itself does not care about how we write it down.

To avoid confusion, each number we write down without explicitly specifying a basis is to be understood as a decimal number. Numbers in a basis other than ten always will come with some hint on the basis.

Binary Numbers#

Numbers in positional representation with base 2 are called binary numbers. They frequently appear in computer engineering. Symbols for the two digits are 0, 1 and sometimes the letters O, I.

Example: Number twelve in binary representation is

\[\begin{equation*} \text{twelve}=1\cdot b^3+1\cdot b^2+0\cdot b^1+0\cdot b^0=1100\quad\text{(binary)}. \end{equation*}\]

Octal Numbers#

Base 8 yields octal numbers. For octal numbers the usual digits 0 to 8 can be used.

Example:

\[\begin{align*} 12&=1\cdot 8^1+4\cdot 8^0=14\quad\text{(octal)}. \end{align*}\]

Octal numbers occur for instance in file access permission on Unix-like systems because access is controlled by 3 sets (owner, group, all) of 3 bits (read, write, execute). Thus, all possible combinations can be conveniently expressed by three-digit octal numbers.

Example: Access right 750 (which is 111 101 000 in binary notation) says that the file’s owner may read, write and execute the file. The owner’s group is not allowed to write to the file (only read and execute). All other users do not have any access right.

Hexadecimal Numbers#

Base 16 yields hexadecimal numbers. For hexadecimal numbers we use 0 to 9 followed by the symbols a, b, c, d, e, f to denote the digits.

Examples:

\[\begin{align*} 12&=12\cdot 16^0=\mathrm{c}\quad\text{(hexadecimal)},\\ 125&=7\cdot 16^1+13\cdot 16^0=7\mathrm{d}\quad\text{(hexadecimal)}. \end{align*}\]

Note that letters a to f might be digits of a hexadecimal number as well as variable names. Have a look at the context to get the correct meaning. Sometimes capital letters A, B, C, D, E, F are used.

Hexadecimal numbers occur in many different situations because the range 0 to 255 of a byte value maps exactly to the set of all two-digit hexadecimal numbers: 00 to ff. We will meet this notation when specifying colors.

Example: The color value ff c0 60 yields a light orange (100% red intensity, 75% green intensity, 38% blue intensity).

screenshot of color selection window with highlighted hexadecimal color value

Fig. 20 Professional graphics programs show hexadecimal color values, often denoted as ‘HTML notation’, because hexadecimal color values frequently occur in HTML code for websites.#

Software and Programming Languages#

Software is a stream of binary data to be read and processed by the CPU. The task of a software developer is to generate streams of binary data which make the CPU do what the software developer wants it to do.

Hint

‘Binary data’ has at least two different meanings, depending on the context.

  • In programming contexts, where we have to distinguish between computer and human readable data, data is considered ‘binary’ if it has no useful interpretation as text.

  • In more general contexts, data is considered ‘binary’ if it is or can be represented as a sequence of zeros and ones. In this sense, a picture is not binary data, but a digital copy consisting of pixels instead of brushstrokes is binary data.

Modern software has a size of several megabytes or even gigabytes. It isn’t impossible for humans to generate such large and complex amounts of binary data by hand. Instead, the process of software development has been automated step by step beginning from scratch with directly coding zeros and ones in the 1950s up to nowadays higher programming languages.

Assemblers#

A first step of automation has been the invention of assemblers. That are computer programs which transform a set of to some extent human readable codewords to a sequence of zeros and ones processable by a CPU. Here is an example:

mov 120, eax
mov 124, ebx
add ebx, eax
mov eax, 128

The first line tells the CPU to read 4 bytes from memory address 120 and to store them in one of its registers (a kind of internal memory). Second line does the same, but with memory address 124 and a second CPU register. Then the CPU is told to add both values. The CPU stores the result in its eax register. The last line makes the CPU write the result of addition to memory address 128.

Writing computer programs in assembler code made software development much easier. But due to the very limited instruction set reflecting one-to-one the instruction set of the CPU, programs are hard to read and tightly bound to the hardware they were designed for. The only advantage of assembler code compared to modern programming languages is its speed of execution and its small size after transforming it to binary code. The first initialization routine of modern operating systems is still written in assembler code, because it has to fit into a small predefined portion of a storage device called boot sector.

Structured Programming#

A further step in the evolution of programming languages are languages for structured programming. Examples are C, BASIC, Pascal. Here the hardware is almost completely abstracted and a relatively complex program, the compiler, is needed to transform the source code written by the software developer to binary code for the CPU. Here is a snipped of a C program:

int a, b;
a = 5;
b = 10 * a + 7;
printf("result is %i", b);

The first line tells the compiler that we need two places in memory for storing integer values. The second line makes the CPU move the value 5 to the place in memory referenced by a. Third line makes the CPU do some calculations and store the result in memory referenced by b. Finally, the result shall be printed on screen. Writing this in assembler code would require some hundred lines of code and we would have to take care of memory organization (where is free space?) and of the instruction set of the CPU. Both is done by the compiler. Especially the C language is still of great importance. It is used, for example, for large parts of Linux and Windows.

Object-Oriented Programming#

A further layer of abstraction is object-oriented programming. Instead of handling hundreds of variables and hundreds of functions for their processing, everything is organized in a well structured way reflecting the structure of the real world. Examples of programming languages allowing for object-oriented programming are C++, Java, Python.

Compiler vs. Interpreter#

Source code of a computer program either is compiled or interpreted. Compiling means that the source code is translated to binary code and after finishing this translation it can be executed, that is, fed to the CPU. Interpretation means that the source code is translated line by line and each translated line is sent immediately to the processor.

Compiled programs run much faster than interpreted ones. But interpreted programs allow for simpler debugging and more intuitive elements in the programming language. Sometimes interpreted programs are called scripts and corresponding languages are denoted as scripting languages. C and C++ are compiled languages whereas Python is interpreted. Java is somewhere in between.