Advanced Indexing#

One of Pandas’ most useful features is its powerful indexing mechanism. Here we’ll discuss several types of index objects.

import pandas as pd

Series and data frames have one or two index objects, respectively. They are accessible via Series.index or DataFrame.index and DataFrame.columns. An index object is a list-like object holding all row or column labels.

To create a new index, call the constructor of the Index class:

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
new_index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Reindexing#

An index object may be replaced by another one. We have to take care whether data alignment between old and new index shall be applied or not.

With Data Alignment#

Series and DataFrame objects provide the reindex method. This method takes an index object (or a list of labels) and replaces the existing index by the new one. Data alignment is applied, that is, rows/columns with label in the intersection of old and new index remain unchanged, but rows/columns with old label not in the new index are dropped. If there are labels in the new index which aren’t in the old one, then rows/columns are filled with nan or some specified value or by some more complex filling logic.

s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
s = s.reindex(new_index)
s
a    123
b    456
e    789
dtype: int64 
a    123.0
b    456.0
c      NaN
d      NaN
e    789.0
dtype: float64

With fill value:

s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
s = s.reindex(new_index, fill_value=0)
s
a    123
b    456
e    789
dtype: int64 
a    123
b    456
c      0
d      0
e    789
dtype: int64

With filling logic:

s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
s = s.reindex(new_index, method='bfill')
s
a    123
b    456
e    789
dtype: int64 
a    123
b    456
c    789
d    789
e    789
dtype: int64

The align method reindexes two series/data frames such that both have the same index.

s1 = pd.Series({'a': 123, 'b': 456, 'e': 789})
s2 = pd.Series({'a': 98, 'c': 76, 'e': 54})
print(s1, '\n')
print(s2, '\n')

s1, s2 = s1.align(s2, axis=0)
print(s1, '\n')
print(s2)
a    123
b    456
e    789
dtype: int64 

a    98
c    76
e    54
dtype: int64 

a    123.0
b    456.0
c      NaN
e    789.0
dtype: float64 

a    98.0
b     NaN
c    76.0
e    54.0
dtype: float64

Without Data Alignment#

To simply replace an index without data alignment, that is to rename all the labels, there are two variants:

  • replace the index object by a new one of same length via usual assignment,

  • use an existing column as index.

s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')

new_index = pd.Index(['aa', 'bb', 'cc'])
s.index = new_index
s
a    123
b    456
e    789
dtype: int64 
aa    123
bb    456
cc    789
dtype: int64

To use a column of a data frame as index call the set_index method and provide the column label.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b','c'], columns=['A', 'B', 'C'])
display(df)

df = df.set_index('A')
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9
B C
A
1 2 3
4 5 6
7 8 9

To convert the index to a usual column call reset_index. The index will be replaced by the standard index (integers starting at 0).

df = df.reset_index()
df
A B C
0 1 2 3
1 4 5 6
2 7 8 9

Index Sharing#

Index objects may be shared between several series or data frames. Simply pass the index of an existing series or data frame to the constructor of a new series or data frame or assign it directly or use reindex.

s1 = pd.Series({'a': 123, 'b': 456, 'e': 789})
s2 = pd.Series([4, 6, 8], index=s1.index)

s1.index is s2.index
True

Note

Reindexing two series/data frames with align (see above) results in a shared index.

Enlargement by Assignment#

To append data to a series or data frame we may use label based indexing:

s = pd.Series({'a': 123, 'b': 456})
print(s, '\n')

s.loc['e'] = 789
s
a    123
b    456
dtype: int64 
a    123
b    456
e    789
dtype: int64
df = pd.DataFrame([[1, 2], [3, 4]], index=['a', 'b'], columns=['A', 'B'])
display(df)

df.loc['c', 'D'] = 5
df
A B
a 1 2
b 3 4
A B D
a 1.0 2.0 NaN
b 3.0 4.0 NaN
c NaN NaN 5.0

Range Indices#

Pandas’ standard index is of type RangeIndex. It’s used whenever a series or a data frame is created without specifying an index.

index = pd.RangeIndex(5, 21, 2)
print(index, '\n')

for k in index:
    print(k)
RangeIndex(start=5, stop=21, step=2) 

5
7
9
11
13
15
17
19

Interval Indices#

The IntervalIndex class allows for imprecise indexing. Each item in a series or data frame can be accessed by any number in a specified interval.

interval_list = [pd.Interval(2, 3), pd.Interval(6, 7), pd.Interval(6.5, 9)]
print(interval_list, '\n')

s = pd.Series([23, 45, 67], index=pd.IntervalIndex(interval_list, closed='left'))
s
[Interval(2, 3, closed='right'), Interval(6, 7, closed='right'), Interval(6.5, 9, closed='right')] 
[2.0, 3.0)    23
[6.0, 7.0)    45
[6.5, 9.0)    67
dtype: int64

Indexing by concrete numbers:

print(s.loc[2], '\n')
# print(s.loc[3], '\n')    # KeyError
print(s.loc[2.5], '\n')
print(s.loc[6.7])
23 

23 

[6.0, 7.0)    45
[6.5, 9.0)    67
dtype: int64

Indexing by intervals:

# print(s.loc[pd.Interval(2, 3)])    # KeyError
print(s.loc[pd.Interval(2, 3, 'left')])    # only exact matches!
23

IntervalIndex objects provide overlaps and contains methods for more flexible indexing:

mask = s.index.overlaps(pd.Interval(2.5, 6.4))
print(mask)
s.loc[mask]
[ True  True False]
[2.0, 3.0)    23
[6.0, 7.0)    45
dtype: int64
mask = s.index.contains(6.7)
print(mask)
s.loc[mask]
[False  True  True]
[6.0, 7.0)    45
[6.5, 9.0)    67
dtype: int64

Note

In principle contains should work with intervals instead of concrete numbers, too. But in Pandas 1.5.1 NotImplementedError is raised.

Multi-Level Indexing#

Up to some details, multi-level indexing is indexing using tuples as labels. Corresponding indexing objects are of type MultiIndex. Major application of multi-level indices are indices representing several dimensions. Thus, high dimensional data can be stored in a two-dimensional data frame.

Creating a Multi-Level Index#

Let’s start with a two-level index. First level contains courses of studies provided by a university. Second level contains some lecture series. Data is the number of students from each course attending a lecture and the average rating for each lecture.

Using the MultiIndex constructor is the most general, but not very straight forward way to create a multi-level index. We have to provide lists of labels for each level. In addition, we need lists of codes for each level indicating which label to use at each position. Each level may have a name.

courses = ['Mathematics', 'Physics', 'Philosophie']
lectures = ['Computer Science', 'Mathematics', 'Epistemology']

courses_codes =  [0, 0, 0, 1, 1, 1, 2, 2, 2]
lectures_codes = [0, 1, 2, 0, 1, 2, 0, 1, 2]

index = pd.MultiIndex(levels=[courses, lectures],
                      codes=[courses_codes, lectures_codes],
                      names=['course', 'lecture'])

data = zip([10, 15, 8, 20, 17, 3, 2, 1, 89],
           [2.1, 1.3, 3.6, 3.0, 1.6, 4.7, 3.9, 4.9, 1.1])

df = pd.DataFrame(data, index=index, columns=['students', 'rating'])

df
students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Epistemology 8 3.6
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7
Philosophie Computer Science 2 3.9
Mathematics 1 4.9
Epistemology 89 1.1

Alternative creation methods are MultiIndex.from_arrays, MultiIndex.from_tuples, MultiIndex.from_product, MultiIndex.from_frame.

Hint

The mentioned creation methods are static methods. See Types for some explanation of the concept.

The above index contains each combination of items from two lists. Thus from_product is applicable:

courses = ['Mathematics', 'Physics', 'Philosophie']
lectures = ['Computer Science', 'Mathematics', 'Epistemology']

index = pd.MultiIndex.from_product([courses, lectures], names=['course', 'lecture'])

data = zip([10, 15, 8, 20, 17, 3, 2, 1, 89],
           [2.1, 1.3, 3.6, 3.0, 1.6, 4.7, 3.9, 4.9, 1.1])

df = pd.DataFrame(data, index=index, columns=['students', 'rating'])

df
students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Epistemology 8 3.6
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7
Philosophie Computer Science 2 3.9
Mathematics 1 4.9
Epistemology 89 1.1

Level information is stored in the levels member variable:

df.index.levels
FrozenList([['Mathematics', 'Philosophie', 'Physics'], ['Computer Science', 'Epistemology', 'Mathematics']])

Note, that multi-level indexing is not restricted to row indexing. Multi-level column indexing works in exactly the same manner.

Accessing Data#

Accessing data works as for other types of indices. Labels now are tuples containing one item per level. But there exist additional techniques specific to multi-level indices.

Single Tuples#

df.loc[('Physics', 'Computer Science'), :]
students    20.0
rating       3.0
Name: (Physics, Computer Science), dtype: float64
df.iloc[1, 1]
1.3
df.loc[[('Physics', 'Computer Science'), ('Mathematics', 'Epistemology')], :]
students rating
course lecture
Physics Computer Science 20 3.0
Mathematics Epistemology 8 3.6

Slicing works as usual.

Slicing Inside Tuples#

A new feature specific to multi-level indexing is slicing inside tuples. We would expect notation like

('Physics', :)

to get all rows with Physics at first level. But usual slicing syntax is not available here. Instead we have to use the built-in slice function. It takes start, stop and step values (start and step default to None) and returns a slice object. More precisely, the slice function is the constructor for slice objects. A slice object simply holds three values (start, stop, step).

df.loc[('Physics', slice(None)), :]
students rating
course lecture
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7

With slice(None) we create a slice object interpreted as all (analogously to :).

Slicing in the first level works, too.

df = df.sort_index()
df.loc[(slice('Mathematics', 'Physics'), 'Epistemology'), :]
students rating
course lecture
Mathematics Epistemology 8 3.6
Philosophie Epistemology 89 1.1
Physics Epistemology 3 4.7

Note that label based slicing above requires a sorted index. Thus, we have to call sort_index first.

An alternative to slice is creating a pd.IndexSlice object, which allows for natural slicing syntax:

df.loc[pd.IndexSlice['Mathematics':'Physics', 'Epistemology'], :]
students rating
course lecture
Mathematics Epistemology 8 3.6
Philosophie Epistemology 89 1.1
Physics Epistemology 3 4.7

Building a Mask Based on a Level#

The get_level_values method of an index object takes a level name or level index as argument and returns a simple Index object containing only the labels at the specified level. This object can then be used to create a boolean array for row indexing.

import numpy as np

no_physics_mask = df.index.get_level_values('course') != 'Physics'
no_epistemology_mask = df.index.get_level_values(1) != 'Epistemology'

mask = np.logical_and(no_physics_mask, no_epistemology_mask)

df.loc[mask, :]
students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Philosophie Computer Science 2 3.9
Mathematics 1 4.9

Comparing an index object with a single value results in a one-dimensional boolean NumPy array with same length as the index object. NumPy’s logical_and method implements elementwise logical and.

Cross-Sections#

There’s a shorthand for selecting all rows with a given label at a given level: xs. This method takes a label and a level and returns corresponding rows as data frame.

df.xs('Mathematics', level='lecture')
students rating
course
Mathematics 15 1.3
Philosophie 1 4.9
Physics 17 1.6

The level Keyword Argument#

Many Pandas functions accept a level keyword argument (like xs above or drop) to provide functionality adapted to multi-level indices.

More on Multi-Level Indexing#

There are many different styles for using multi-level indexing. Some of them are very confusing for beginners, because same syntax may have different semantics depending on the objects (row/column label, tuple, list) passed as arguments. Here we only considered save and syntactically clear variants. To get an idea of other indexing styles have a look at MultiIndex / advanced indexing in the Pandas user guide.