Series#

Pandas Series is one of two fundamental Pandas data types (the other is DataFrame). A Series object holds one-dimensional data, like a list, but with more powerful indexing capabilities. Data is stored in an underlying one-dimensional NumPy array. Thus, most operations are much more efficient than with lists.

import pandas as pd

Creation of Series Objects#

A Series object can be created from a Python list or a dictionary, for instance. See Series constructor in Pandas’ documentation.

s = pd.Series([23, 45, 67, 78, 90])
s
0    23
1    45
2    67
3    78
4    90
dtype: int64
s = pd.Series({'a': 12, 'b': 23, 'c': 45, 'd': 67})
s
a    12
b    23
c    45
d    67
dtype: int64

A Series consists of an index (first column printed) and its data (second column printed). All data items have to be of identical type. The length of a Series is provided by the size member variable (you may also use Python’s built-in function len).

s.size
4

Data Alignment#

Data in a Series behaves like a one-dimensional ndarray, but Pandas’ indexing mechanisms make things different from NumPy. Pandas implements automatic data alignment. That is, data items do not have fixed positions like in a NumPy array. Instead, only the (possibly non-integer) index matters. Here is a first example:

a = pd.Series({'a': 2, 'b': 4, 'c': 3, 'd': 6})
b = pd.Series({'a': 1, 'b': 5, 'd': 7, 'e': 9})
print(a, '\n')
print(b, '\n')
print(a + b)
a    2
b    4
c    3
d    6
dtype: int64 

a    1
b    5
d    7
e    9
dtype: int64 

a     3.0
b     9.0
c     NaN
d    13.0
e     NaN
dtype: float64

Both series have indices a, b, d. Thus, addition is defined. But c and e appear only in one of the series. Addition fails and the result is not a number.

Important

Note that data type now is float although every number is an integer. The reason is, that integers do not allow to represent the float NaN. Thus, Pandas has to change to data type of the result. We will come back to such NaN problems later on.

If we had used NumPy, then the result would be the sum of two vectors:

import numpy as np
a = np.array([2, 4, 3, 6])
b = np.array([1, 5, 7, 9])

a + b
array([ 3,  9, 10, 15])

Underlying Data Structures#

Index and data are accessible via index and array members of Series objects:

s = pd.Series([23, 45, 67, 78, 90])

print(s.index, '\n')
print(s.array, '\n')
print(type(s.index), '\n')
print(type(s.array))
RangeIndex(start=0, stop=5, step=1) 

<PandasArray>
[23, 45, 67, 78, 90]
Length: 5, dtype: int64 

<class 'pandas.core.indexes.range.RangeIndex'> 

<class 'pandas.core.arrays.numpy_.PandasArray'>

The index member is one of several index types. Index objects will be discussed later on. The array member is an array type defined by Pandas. If we want to have a NumPy array, we should call to_numpy():

a = s.to_numpy()

print(a, '\n')
print(type(a))
[23 45 67 78 90] 

<class 'numpy.ndarray'>

Indexing#

Accessing single items or subsets of a series works more or less the same way as for lists or dictionaries or NumPy arrays.

The flexibility of Pandas’ multiple-items indexing mechanisms sometimes leads to confusion and unexpected erros. In addition, some features are not well documented and a transition to more predictable and more clearly structured indexing behavior is in progress.

Overview#

There exist four widely used indexing mechanisms (here s is some series):

  • s[...]: Python style indexing

  • s.ix[...]: old Pandas style indexing (removed from Pandas in January 2020)

  • s.loc[...] and s.iloc[...]: new Pandas style indexing

  • s.at[...] and s.iat[...]: new Pandas style indexing for more efficient access to single items

Deprecated Indexing#

Python style indexing and old Pandas style indexing (the ix indexer) allow for position based indexing and label based indexing. Position based means that, like for NumPy arrays, we refer to an item by its position in the series. The first item has position 0. Thus, the series’ index object is completely ignored. Providing an item of the series’ index member as index, is refered to as label based indexing.

Both [...] and ix[...] behave slightly differently when using slicing. A major problem is that sometimes it is not clear whether positional or label based indexing shall be used. Consider a series with an index made of id numbers, that is, integers:

s = pd.Series({123: 3, 45: 4, 542: 7, 2: 19})
print(s, '\n')

print(s[123], '\n')    # label based
print(s[2], '\n')      # label based
print(s[0:2])          # position based (October 2022 warning: will be label based in future)
123     3
45      4
542     7
2      19
dtype: int64 

3 

19 

123    3
45     4
dtype: int64
/tmp/ipykernel_234813/411904015.py:6: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  print(s[0:2])          # position based (October 2022 warning: will be label based in future)

Without knowing the exact mechanism behind [...], which in fact calls the series’ __getitem__ method, code becomes unreadable. Same is true for ix. The ix indexer has been removed from Pandas since version 1.0.0 (January 2020). Indexing with [...] is still available, but should be avoided, at least for series with integer labels.

New Indexing Mechanism#

Prefered indexing is via loc[...] and iloc[...], the first for label based indexing, the second for positional indexing. Positional indexing is also known as integer indexing, thus the i in iloc. Slicing and boolean indexing are supported (see below).

If only a single item shall be accessed, then loc[...] and iloc[...] might be too slow due to the implementation of complex features like slicing. For single item access one should use at[...] and iat[...] providing label based and positional indexing, respectively.

Positional Indexing#

Positional indexing via iloc[...] or iat[...] works like for one-dimensional NumPy arrays.

s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
print(s, '\n')

print(s.iloc[1:3], '\n')          # slicing
print(s.iloc[[3, 0, 2]], '\n')    # list of indices
print(s.iloc[[True, False, False, True, True]], '\n')    # boolean indexing
print(s.iat[3], '\n')             # efficient single element access
print(s.iloc[3])                  # less efficient single element access
a    1
b    2
c    3
d    4
e    5
dtype: int64 

b    2
c    3
dtype: int64 

d    4
a    1
c    3
dtype: int64 

a    1
d    4
e    5
dtype: int64 

4 

4

An important difference to NumPy indexing is, that the result is a series again. That is, the index of the selected items is returned, too.

Label Based Indexing#

Label based indexing works like with dictionaries. But slicing is allowed.

s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})
print(s, '\n')

print(s.loc['b':'d'], '\n')            # slicing
print(s.loc[['d', 'a', 'c']], '\n')    # list of labels
print(s.loc[[True, False, False, True, True]], '\n')    # boolean indexing
print(s.at['d'], '\n')                 # efficient single element access
print(s.loc['d'])                      # less efficient single element access
a    1
b    2
c    3
d    4
e    5
dtype: int64 

b    2
c    3
d    4
dtype: int64 

d    4
a    1
c    3
dtype: int64 

a    1
d    4
e    5
dtype: int64 

4 

4

Important

Note that slicing with labels includes the stop item!

Different items with identical labels are allowed. In such case loc[...] returns all items with the specified label and at[...] returns an array of all values with the specified label.

s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'b', 'c'])
print(s, '\n')

print(s.loc['b'], '\n')
print(s.at['b'])
a    1
b    2
b    3
c    4
dtype: int64 

b    2
b    3
dtype: int64 

b    2
b    3
dtype: int64

Indexing by Callables#

Both loc[...] and iloc[...] accept a function as their argument. The function has to take a series as argument and has to return something allowed for indexing (list of indices/labels, boolean array and so on).

Scenarios justifying indexing by callables are relatively complex.

Views and Copies#

As for NumPy arrays, indexing Pandas series may return a view of the series. That is, modifying the extracted subset of items might modify the original series. If you really need a copy of the items, use the copy method of Series objects.

Some Useful Member Functions#

A full list of member functions for Series objects is provided in Pandas’ documentation. Here we only list a few of them.

A First Look at a Series#

If a series is read from a file we would like to get some basic information about the series.

With describe we get statistical information about a series. The function returns a Series object containing the collected information.

First and last items are returned by head and tail, respectively. Both take an optional argument specifying the number of items to return. Default is 5.

s = pd.Series([2, 4, 6, 5, 4, 3, -2, 3, 2, 5])

print(s.describe(), '\n')
print(s.head(), '\n')
print(s.tail(3))
count    10.000000
mean      3.200000
std       2.250926
min      -2.000000
25%       2.250000
50%       3.500000
75%       4.750000
max       6.000000
dtype: float64 

0    2
1    4
2    6
3    5
4    4
dtype: int64 

7    3
8    2
9    5
dtype: int64

Note that we did not specify labels explicitly. Thus, the Series constructor uses item positions as labels.

Iterating Over a Series#

Iterating over the values of a series works like for Python lists:

s = pd.Series([2, 4, 6, 5, 4, 3, -2, 3, 2, 5])

for i in s:
    print(i)
2
4
6
5
4
3
-2
3
2
5

If labels are required, too, call items:

for lab, val in s.items():
    print(lab, val)
0 2
1 4
2 6
3 5
4 4
5 3
6 -2
7 3
8 2
9 5

If next to labels also positional indices are required use an additional enumerate:

for pos, (lab, val) in enumerate(s.items()):
    print(pos, lab, val)
0 0 2
1 1 4
2 2 6
3 3 5
4 4 4
5 5 3
6 6 -2
7 7 3
8 8 2
9 9 5

Vectorized Operators#

Like NumPy arrays Pandas series implement most mathematical and comparison operators.

a = pd.Series([1, 2, 3, 4])
b = pd.Series([4, 0, 6, 3])

print(a * b, '\n')
print(a < b)
0     4
1     0
2    18
3    12
dtype: int64 

0     True
1    False
2     True
3    False
dtype: bool

Hint

Remember that Pandas uses data alignment, that is, labels matter, positions are irrelevant.

Functions all and any for boolean series are available, too.

s = pd.Series([True, True, False])

print(s.all())
print(s.any())
False
True

Removing and Adding Items#

With drop we can remove items from a series. Simply pass a list of labels to the function.

s = pd.Series([2, 4, 6, 5, 4, 3, -2, 3, 2, 5])
print(s, '\n')

t = s.drop([3, 4, 5])
print(t)
0    2
1    4
2    6
3    5
4    4
5    3
6   -2
7    3
8    2
9    5
dtype: int64 

0    2
1    4
2    6
6   -2
7    3
8    2
9    5
dtype: int64

The concat method concatenates two series.

a = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})
b = pd.Series({'d': 0, 'e': 5, 'f': 6, 'g': 7})

c = pd.concat([a, b])

print(a, '\n')
print(b, '\n')
print(c)
a    1
b    2
c    3
d    4
dtype: int64 

d    0
e    5
f    6
g    7
dtype: int64 

a    1
b    2
c    3
d    4
d    0
e    5
f    6
g    7
dtype: int64

Note that there is no check on duplicate index labels, since duplicates are no problem (see above).

Modifying Data in a Series#

Important functions for modifying data in a series are:

  • apply (apply a function to each item or to the whole data array),

  • combine (choose items from two series to form a new one),

  • where (replace items which do not satisfy a condition),

  • mask (replace items which satisfy a condition)