NumPy Arrays#

NumPy provides two fundamental tools for data science purposes:

  • a data type for storing tabular numerical data,

  • very efficient functions for computations on large amounts of numerical data.

NumPy’s basic data type is called ndarray (n-dimensional array), often called NumPy array.

NumPy’s standard abbreviation is np.

import numpy as np

Python Lists versus NumPy Arrays#

From mathematics we know Vectors and Matrices. A vector is a (one-dimensional) list of numbers. A matrix is a (two-dimensional) field of numbers. Vectors could be represented by lists in Python, whereas a matrix would be a list if lists (a list of rows or a list of columns).

Using Python lists for representing large vectors and matrices is very inefficient. Each item of a Python list has its own location somewhere in memory. When reading a whole list, to multiply a vector by some number, for instance, Python reads the first list item, then looks for the memory location of the second, then reads the second, and so on. A lot of memory management is involved.

To significantly improve performance, NumPy provides the ndarray data type. The most important property of an ndarray is its dimension. A one-dimensional array stores a vector. A two-dimensional array stores a matrix. Zero-dimensional arrays store nothing, but are valid Python objects. Visualization of arrays with dimension above two is somewhat difficult. A three-dimensional array can be visualized as cuboid of numbers, each number described by three indices (row, column, depth level). We will meet dimensions of three and above almost every day when diving into machine learning. One example are color images: two-dimensions for pixel positions, one dimension for color channels (red, green, blue, transparency).

Why are NumPy arrays more efficient?

  • All items of a NumPy array have to have identical data type, mostly float or integer. This saves time and memory for handling different types and type conversions.

  • All items of a NumPy array are stored in a well-structured contiguous block of memory. To find the next item or to copy a whole array or part of it much less memory management operations are required.

  • NumPy provides optimized mathematical operations for vectors and matrices. Instead of processing arrays item by item, NumPy functions take the whole array and process it in compiled C code. Thus, the item-by-item part is not done by the (slow) Python interpreter, but by (very fast) compiled code.

Creating NumPy Arrays#

Converting Python Lists to Arrays#

There are several ways to create NumPy arrays. We start with conversion of Python lists or tuples by NumPy’s array function.

Passing a list or a tuple to array yields a one-dimensional ndarray. The data type is determined by NumPy to be the simplest type which can hold all objects in the list or tuple.

a = np.array([23, 42, 7, 4, -2])

print(a)
print(a.dtype)
[23 42  7  4 -2]
int64

The member variable ndarray.dtype contains the array’s data type. Here NumPy decided to use int64, that is, integers of length 8 byte. Available types will be discussed below. An example with floats:

b = np.array([2.3, 4.2, 7, 4, -2])

print(b)
print(b.dtype)
[ 2.3  4.2  7.   4.  -2. ]
float64

Important

NumPy ships with its own data types for numbers to allow for more efficient storage and computations. Python’s int type allows for arbitrarily large numbers, whereas NumPy has different types for integers with different (and finite!) numerical ranges. NumPy also knows several types of floats differing in precision (number of decimal places) and range. Wherever possible conversion between Python types and NumPy types is done automatically.

Higher-Dimensional Arrays from Lists#

To get higher-dimensional arrays use nested lists:

c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(c)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

New Arrays as Return Values#

Next to explicit creation, NumPy arrays may be the result of mathematical operations:

d = a + b

print(type(d))
<class 'numpy.ndarray'>

To see that d is indeed a new ndarray and not an in-place modified a or b, we might look at the object ids, which are all different:

print(id(a), id(b), id(d))
140549050909872 140549364309744 140549050912656

Functions for Creating Special Arrays#

A third way for creating NumPy arrays is to call specific NumPy functions returning new arrays. From np.zeros we get an array of zeros. From np.ones we get an array of ones. There are much more functions like zeros and ones, see Array creation routines in Numpy’s documentation.

a = np.zeros(5)
b = np.ones((2, 3))

print(a, '\n')
print(b)
[0. 0. 0. 0. 0.] 

[[1. 1. 1.]
 [1. 1. 1.]]

NumPy almost always defaults to floats if no data type is explicitly provided.

Properties of NumPy Arrays#

Objects of type ndarray have several member variables containing important information about the array:

  • ndim: number of dimensions,

  • shape: tuple of length ndim with array size in each dimension,

  • size: total number of elements,

  • nbytes: number of bytes occupied by the array elements,

  • dtype: the array’s data type.

a = np.zeros((4, 3))

print(a.ndim)
print(a.shape)
print(a.size)     # 4 * 3
print(a.nbytes)   # 4 * 3 * 8
print(a.dtype)
2
(4, 3)
12
96
float64

It’s important to know that shape matters. In mathematics almost always we identify vectors with matrices having only one column. But in NumPy these are two different things. A vector has shape (n, ), that is ndim is 1, whereas a one-column matrix has shape (n, 1) with ndim of 2. Consequently, a vector neither is a row nor a column in NumPy. It’s simply a list of numbers, nothing more.

a = np.zeros(5)
b = np.zeros((5, 1))
c = np.zeros((1, 5))

print(a, '\n')
print(b, '\n')
print(c)
[0. 0. 0. 0. 0.] 

[[0.]
 [0.]
 [0.]
 [0.]
 [0.]] 

[[0. 0. 0. 0. 0.]]

List-Like Indexing#

Elements of NumPy arrays can be accessed similarly to items of Python lists. That is, the first item in a one-dimensional ndarray has index 0 and the last one has index ndarray.size. Slicing is allowed, too.

a = np.array([23, 42, -7, 3, 10])

print(a[0], '\n')
print(a[1:3], '\n')
print(a[::2])
23 

[42 -7] 

[23 -7 10]

In case of multi-dimensional arrays we have to provide an index for each dimension. Slicing is done per dimension.

a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(a, '\n')

print(a[0, 1], '\n')
print(a[1:3, 0:2], '\n')
print(a[::2, ::2], '\n')
print(a[1, :])
[[1 2 3]
 [4 5 6]
 [7 8 9]] 

2 

[[4 5]
 [7 8]] 

[[1 3]
 [7 9]] 

[4 5 6]

Here : stands for ‘all indices of the dimension’.

Note

Selecting all elements in the last dimensions like in a[1, :] can be abbreviated to a[1]. Same holds for higher dimensions: a[1, 3, :, :] is equivalent to a[1, 3]. The drawback is that one doesn’t see immediately the array’s dimensionality.

Data Types#

NumPy knows many different numerical data types. Often we do not have to care about types (NumPy will choose suitable ones), but sometimes we have to specify data types explicitly (see examples below).

Almost all NumPy functions accept the keyword argument dtype to specify the data type of the function’s return value. Either pass a string with the desired type’s name or pass a type object. Passing Python types like int makes NumPy choose the most appropriate NumPy type (here, np.int64 or the string 'int64').

a = np.zeros((2, 3))
b = np.zeros((2, 3), dtype=np.int64)

print(a, '\n')
print(b)
[[0. 0. 0.]
 [0. 0. 0.]] 

[[0 0 0]
 [0 0 0]]

NumPy types for integers:

  • np.int8, np.int16, np.int32, np.int64 (signed integers of different range),

  • np.uint8, np.uint16, np.uint32, np.uint64 (unsigned integers of different range),

NumPy types for floats:

  • np.float16, np.float32, np.float64 (different precision and range)

For booleans there is np.bool8, which is very similar to Pythons bool (both using 8 times as much memory as required).

Types for complex numbers are available, too. See Built-in scalar types in NumPy’s documentation for details.

Hint

The dtype member of ndarrays and the dtype argument to NumPy functions carry more information than the bare type (e.g., ‘signed integer of length 64 bits’). They also contain information about how data is organized in memory. This is important for efficient import of data from external sources. Details will be discussed in Saving and Loading Non-Standard Data.

Example: Saving Memory by Manually Choosing Types#

Working with large NumPy arrays we have to save memory wherever possible. One important ingredient for memory efficiency is choosing small types, that is, types with small range. Often we work with arrays of zeros and ones or of small integers only. Then we should choose the smallest integer type:

a = np.ones(1000)    # defaults to np.int64
b = np.ones(1000, dtype=np.int8)

print(f'a has {a.nbytes} bytes')
print(f'b has {b.nbytes} bytes')
a has 8000 bytes
b has 1000 bytes

Having a data set with one billion numbers choosing the correct type decides about requiring 1 GB or 8 GB of memory!

Example: Unsuitable Default Type#

Creating an array without explicitly providing a data type makes NumPy choose np.int64 or np.float64 depending on the presence of floats. This may lead to hard to find errors:

a = np.array([2, 4, 6, 1])    # defaults to np.int64
a[0] = 1.23
a[3] = 0.99

print(a)
[1 4 6 0]

Modifying values in integer arrays converts the new values to the array’s data type, even if information will be lost. To avoid such errors always specify types if working with floats!