Advanced Indexing
Contents
Advanced Indexing#
One of Pandas’ most useful features is its powerful indexing mechanism. Here we’ll discuss several types of index objects.
import pandas as pd
Series and data frames have one or two index objects, respectively. They are accessible via Series.index
or DataFrame.index
and DataFrame.columns
. An index object is a list-like object holding all row or column labels.
To create a new index, call the constructor of the Index
class:
new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
new_index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Reindexing#
An index object may be replaced by another one. We have to take care whether data alignment between old and new index shall be applied or not.
With Data Alignment#
Series
and DataFrame
objects provide the reindex
method. This method takes an index object (or a list of labels) and replaces the existing index by the new one. Data alignment is applied, that is, rows/columns with label in the intersection of old and new index remain unchanged, but rows/columns with old label not in the new index are dropped. If there are labels in the new index which aren’t in the old one, then rows/columns are filled with nan
or some specified value or by some more complex filling logic.
s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')
new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
s = s.reindex(new_index)
s
a 123
b 456
e 789
dtype: int64
a 123.0
b 456.0
c NaN
d NaN
e 789.0
dtype: float64
With fill value:
s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')
new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
s = s.reindex(new_index, fill_value=0)
s
a 123
b 456
e 789
dtype: int64
a 123
b 456
c 0
d 0
e 789
dtype: int64
With filling logic:
s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')
new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])
s = s.reindex(new_index, method='bfill')
s
a 123
b 456
e 789
dtype: int64
a 123
b 456
c 789
d 789
e 789
dtype: int64
The align
method reindexes two series/data frames such that both have the same index.
s1 = pd.Series({'a': 123, 'b': 456, 'e': 789})
s2 = pd.Series({'a': 98, 'c': 76, 'e': 54})
print(s1, '\n')
print(s2, '\n')
s1, s2 = s1.align(s2, axis=0)
print(s1, '\n')
print(s2)
a 123
b 456
e 789
dtype: int64
a 98
c 76
e 54
dtype: int64
a 123.0
b 456.0
c NaN
e 789.0
dtype: float64
a 98.0
b NaN
c 76.0
e 54.0
dtype: float64
Without Data Alignment#
To simply replace an index without data alignment, that is to rename all the labels, there are two variants:
replace the index object by a new one of same length via usual assignment,
use an existing column as index.
s = pd.Series({'a': 123, 'b': 456, 'e': 789})
print(s, '\n')
new_index = pd.Index(['aa', 'bb', 'cc'])
s.index = new_index
s
a 123
b 456
e 789
dtype: int64
aa 123
bb 456
cc 789
dtype: int64
To use a column of a data frame as index call the set_index
method and provide the column label.
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b','c'], columns=['A', 'B', 'C'])
display(df)
df = df.set_index('A')
df
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
B | C | |
---|---|---|
A | ||
1 | 2 | 3 |
4 | 5 | 6 |
7 | 8 | 9 |
To convert the index to a usual column call reset_index
. The index will be replaced by the standard index (integers starting at 0).
df = df.reset_index()
df
A | B | C | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
Index Sharing#
Index objects may be shared between several series or data frames. Simply pass the index of an existing series or data frame to the constructor of a new series or data frame or assign it directly or use reindex
.
s1 = pd.Series({'a': 123, 'b': 456, 'e': 789})
s2 = pd.Series([4, 6, 8], index=s1.index)
s1.index is s2.index
True
Note
Reindexing two series/data frames with align
(see above) results in a shared index.
Enlargement by Assignment#
To append data to a series or data frame we may use label based indexing:
s = pd.Series({'a': 123, 'b': 456})
print(s, '\n')
s.loc['e'] = 789
s
a 123
b 456
dtype: int64
a 123
b 456
e 789
dtype: int64
df = pd.DataFrame([[1, 2], [3, 4]], index=['a', 'b'], columns=['A', 'B'])
display(df)
df.loc['c', 'D'] = 5
df
A | B | |
---|---|---|
a | 1 | 2 |
b | 3 | 4 |
A | B | D | |
---|---|---|---|
a | 1.0 | 2.0 | NaN |
b | 3.0 | 4.0 | NaN |
c | NaN | NaN | 5.0 |
Range Indices#
Pandas’ standard index is of type RangeIndex
. It’s used whenever a series or a data frame is created without specifying an index.
index = pd.RangeIndex(5, 21, 2)
print(index, '\n')
for k in index:
print(k)
RangeIndex(start=5, stop=21, step=2)
5
7
9
11
13
15
17
19
Interval Indices#
The IntervalIndex
class allows for imprecise indexing. Each item in a series or data frame can be accessed by any number in a specified interval.
interval_list = [pd.Interval(2, 3), pd.Interval(6, 7), pd.Interval(6.5, 9)]
print(interval_list, '\n')
s = pd.Series([23, 45, 67], index=pd.IntervalIndex(interval_list, closed='left'))
s
[Interval(2, 3, closed='right'), Interval(6, 7, closed='right'), Interval(6.5, 9, closed='right')]
[2.0, 3.0) 23
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64
Indexing by concrete numbers:
print(s.loc[2], '\n')
# print(s.loc[3], '\n') # KeyError
print(s.loc[2.5], '\n')
print(s.loc[6.7])
23
23
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64
Indexing by intervals:
# print(s.loc[pd.Interval(2, 3)]) # KeyError
print(s.loc[pd.Interval(2, 3, 'left')]) # only exact matches!
23
IntervalIndex
objects provide overlaps
and contains
methods for more flexible indexing:
mask = s.index.overlaps(pd.Interval(2.5, 6.4))
print(mask)
s.loc[mask]
[ True True False]
[2.0, 3.0) 23
[6.0, 7.0) 45
dtype: int64
mask = s.index.contains(6.7)
print(mask)
s.loc[mask]
[False True True]
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64
Note
In principle contains
should work with intervals instead of concrete numbers, too. But in Pandas 1.5.1 NotImplementedError
is raised.
Multi-Level Indexing#
Up to some details, multi-level indexing is indexing using tuples as labels. Corresponding indexing objects are of type MultiIndex
. Major application of multi-level indices are indices representing several dimensions. Thus, high dimensional data can be stored in a two-dimensional data frame.
Creating a Multi-Level Index#
Let’s start with a two-level index. First level contains courses of studies provided by a university. Second level contains some lecture series. Data is the number of students from each course attending a lecture and the average rating for each lecture.
Using the MultiIndex
constructor is the most general, but not very straight forward way to create a multi-level index. We have to provide lists of labels for each level. In addition, we need lists of codes for each level indicating which label to use at each position. Each level may have a name.
courses = ['Mathematics', 'Physics', 'Philosophie']
lectures = ['Computer Science', 'Mathematics', 'Epistemology']
courses_codes = [0, 0, 0, 1, 1, 1, 2, 2, 2]
lectures_codes = [0, 1, 2, 0, 1, 2, 0, 1, 2]
index = pd.MultiIndex(levels=[courses, lectures],
codes=[courses_codes, lectures_codes],
names=['course', 'lecture'])
data = zip([10, 15, 8, 20, 17, 3, 2, 1, 89],
[2.1, 1.3, 3.6, 3.0, 1.6, 4.7, 3.9, 4.9, 1.1])
df = pd.DataFrame(data, index=index, columns=['students', 'rating'])
df
students | rating | ||
---|---|---|---|
course | lecture | ||
Mathematics | Computer Science | 10 | 2.1 |
Mathematics | 15 | 1.3 | |
Epistemology | 8 | 3.6 | |
Physics | Computer Science | 20 | 3.0 |
Mathematics | 17 | 1.6 | |
Epistemology | 3 | 4.7 | |
Philosophie | Computer Science | 2 | 3.9 |
Mathematics | 1 | 4.9 | |
Epistemology | 89 | 1.1 |
Alternative creation methods are MultiIndex.from_arrays
, MultiIndex.from_tuples
, MultiIndex.from_product
, MultiIndex.from_frame
.
Hint
The mentioned creation methods are static
methods. See Types for some explanation of the concept.
The above index contains each combination of items from two lists. Thus from_product
is applicable:
courses = ['Mathematics', 'Physics', 'Philosophie']
lectures = ['Computer Science', 'Mathematics', 'Epistemology']
index = pd.MultiIndex.from_product([courses, lectures], names=['course', 'lecture'])
data = zip([10, 15, 8, 20, 17, 3, 2, 1, 89],
[2.1, 1.3, 3.6, 3.0, 1.6, 4.7, 3.9, 4.9, 1.1])
df = pd.DataFrame(data, index=index, columns=['students', 'rating'])
df
students | rating | ||
---|---|---|---|
course | lecture | ||
Mathematics | Computer Science | 10 | 2.1 |
Mathematics | 15 | 1.3 | |
Epistemology | 8 | 3.6 | |
Physics | Computer Science | 20 | 3.0 |
Mathematics | 17 | 1.6 | |
Epistemology | 3 | 4.7 | |
Philosophie | Computer Science | 2 | 3.9 |
Mathematics | 1 | 4.9 | |
Epistemology | 89 | 1.1 |
Level information is stored in the levels
member variable:
df.index.levels
FrozenList([['Mathematics', 'Philosophie', 'Physics'], ['Computer Science', 'Epistemology', 'Mathematics']])
Note, that multi-level indexing is not restricted to row indexing. Multi-level column indexing works in exactly the same manner.
Accessing Data#
Accessing data works as for other types of indices. Labels now are tuples containing one item per level. But there exist additional techniques specific to multi-level indices.
Single Tuples#
df.loc[('Physics', 'Computer Science'), :]
students 20.0
rating 3.0
Name: (Physics, Computer Science), dtype: float64
df.iloc[1, 1]
1.3
df.loc[[('Physics', 'Computer Science'), ('Mathematics', 'Epistemology')], :]
students | rating | ||
---|---|---|---|
course | lecture | ||
Physics | Computer Science | 20 | 3.0 |
Mathematics | Epistemology | 8 | 3.6 |
Slicing works as usual.
Slicing Inside Tuples#
A new feature specific to multi-level indexing is slicing inside tuples. We would expect notation like
('Physics', :)
to get all rows with Physics
at first level. But usual slicing syntax is not available here. Instead we have to use the built-in slice
function. It takes start, stop and step values (start and step default to None
) and returns a slice
object. More precisely, the slice
function is the constructor for slice
objects. A slice object simply holds three values (start, stop, step).
df.loc[('Physics', slice(None)), :]
students | rating | ||
---|---|---|---|
course | lecture | ||
Physics | Computer Science | 20 | 3.0 |
Mathematics | 17 | 1.6 | |
Epistemology | 3 | 4.7 |
With slice(None)
we create a slice object interpreted as all (analogously to :
).
Slicing in the first level works, too.
df = df.sort_index()
df.loc[(slice('Mathematics', 'Physics'), 'Epistemology'), :]
students | rating | ||
---|---|---|---|
course | lecture | ||
Mathematics | Epistemology | 8 | 3.6 |
Philosophie | Epistemology | 89 | 1.1 |
Physics | Epistemology | 3 | 4.7 |
Note that label based slicing above requires a sorted index. Thus, we have to call sort_index
first.
An alternative to slice
is creating a pd.IndexSlice
object, which allows for natural slicing syntax:
df.loc[pd.IndexSlice['Mathematics':'Physics', 'Epistemology'], :]
students | rating | ||
---|---|---|---|
course | lecture | ||
Mathematics | Epistemology | 8 | 3.6 |
Philosophie | Epistemology | 89 | 1.1 |
Physics | Epistemology | 3 | 4.7 |
Building a Mask Based on a Level#
The get_level_values
method of an index object takes a level name or level index as argument and returns a simple Index
object containing only the labels at the specified level. This object can then be used to create a boolean array for row indexing.
import numpy as np
no_physics_mask = df.index.get_level_values('course') != 'Physics'
no_epistemology_mask = df.index.get_level_values(1) != 'Epistemology'
mask = np.logical_and(no_physics_mask, no_epistemology_mask)
df.loc[mask, :]
students | rating | ||
---|---|---|---|
course | lecture | ||
Mathematics | Computer Science | 10 | 2.1 |
Mathematics | 15 | 1.3 | |
Philosophie | Computer Science | 2 | 3.9 |
Mathematics | 1 | 4.9 |
Comparing an index object with a single value results in a one-dimensional boolean NumPy array with same length as the index object. NumPy’s logical_and
method implements elementwise logical and.
The level
Keyword Argument#
Many Pandas functions accept a level
keyword argument (like xs
above or drop
) to provide functionality adapted to multi-level indices.
More on Multi-Level Indexing#
There are many different styles for using multi-level indexing. Some of them are very confusing for beginners, because same syntax may have different semantics depending on the objects (row/column label, tuple, list) passed as arguments. Here we only considered save and syntactically clear variants. To get an idea of other indexing styles have a look at MultiIndex / advanced indexing in the Pandas user guide.