Restructuring Data#

Restructuring and aggregation are two basic methods for extracting statistical information from data. We start with groupwise aggregation and then discuss several forms of restructuring without and with additional aggregation.

import pandas as pd

Grouping#

Grouping is the first step in the so-called split-apply-combine procedure in data processing. Data is split into groups by some criterion, then some function is applied to each group, finally results get (re-)combinded. Typical functions in the apply step are sum or mean (more general: aggregation) or any type of transform or filtering functions (drop groups containing nan items, for instance).

This chapter follows the structure of the Pandas user guide, but leaves out sections on very specific details. Feel free to have a look at those details later on.

Splitting into Groups and Basic Usage#

Grouping is done by calling the groupby method of a series or data frame. It takes a column label or a list of column labels as argument and returns a SeriesGroupBy or DataFrameGroupBy object. The returned object represents a kind of list of groups, each group being a small series or data frame. All rows in a group have identical values in the columns used for grouping.

The ...GroupBy object offers several methods for working with the determined groups. Iterating over such objects is possible, too.

Grouping by one column and subsequent aggregation yields an index with values from the column used for grouping:

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
                   'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

for name, group in g:
    print('age:', name)
    display(group)

df_means  = g.mean()
df_means
age score
0 2 2.3
1 3 4.5
2 3 3.4
3 2 2.0
4 4 5.4
5 5 7.2
6 5 2.8
7 5 3.9
age: 2
age score
0 2 2.3
3 2 2.0
age: 3
age score
1 3 4.5
2 3 3.4
age: 4
age score
4 4 5.4
age: 5
age score
5 5 7.2
6 5 2.8
7 5 3.9
score
age
2 2.150000
3 3.950000
4 5.400000
5 4.633333

Grouping by two columns and subsequent aggregation yields a multi-level index:

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
                   'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
                   'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby(['age', 'answer'])

for name, group in g:
    print('age:', name[0])
    print('answer:', name[1])
    display(group)

df_means  = g.mean()
display(df_means)
age answer score
0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9
age: 2
answer: no
age answer score
3 2 no 2.0
age: 2
answer: yes
age answer score
0 2 yes 2.3
age: 3
answer: no
age answer score
1 3 no 4.5
2 3 no 3.4
age: 4
answer: no
age answer score
4 4 no 5.4
age: 5
answer: no
age answer score
7 5 no 3.9
age: 5
answer: yes
age answer score
5 5 yes 7.2
6 5 yes 2.8
score
age answer
2 no 2.00
yes 2.30
3 no 3.95
4 no 5.40
5 no 3.90
yes 5.00

Grouping by levels of a multi-level index is possible by providing the level argument to groupby.

With get_group we have access to single groups:

g.get_group((5, 'yes'))
age answer score
5 5 yes 7.2
6 5 yes 2.8

DataFrameGroupBy objects allow for column indexing:

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
                   'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
                   'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

g['answer'].get_group(5)
age answer score
0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9
5    yes
6    yes
7     no
Name: answer, dtype: object

Aggregation#

To apply a function to each column of each group use aggregate. It takes a function or a list of functions as argument. Providing a dictionary of column: function pairs allows for column specific functions.

import numpy as np

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
                   'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
                   'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

display(g.aggregate(np.min))
display(g.aggregate([np.min, np.max]))
display(g.aggregate({'answer': np.min, 'score': np.mean}))
age answer score
0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9
answer score
age
2 no 2.0
3 no 3.4
4 no 5.4
5 no 2.8
answer score
amin amax amin amax
age
2 no yes 2.0 2.3
3 no no 3.4 4.5
4 no no 5.4 5.4
5 no yes 2.8 7.2
answer score
age
2 no 2.150000
3 no 3.950000
4 no 5.400000
5 no 4.633333

With size we get group sizes.

g.size()
age
2    2
3    2
4    1
5    3
dtype: int64

Many aggregation functions are directly accessible from the ...GroupBy object. Examples are ...GroupBy.sum and ...GroupBy.mean. See Computations / descriptive stats for a complete list.

Transformation#

The transform method allows to transform rows groupwise resulting in a data frame with same shape as the original one.

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
                   'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
                   'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

# substract the groups mean score in each age group
df['score'] = g['score'].transform(lambda score: score - score.mean())
df
age answer score
0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9
age answer score
0 2 yes 0.150000
1 3 no 0.550000
2 3 no -0.550000
3 2 no -0.150000
4 4 no 0.000000
5 5 yes 2.566667
6 5 yes -1.833333
7 5 no -0.733333

Filtering#

To remove groups use filter method. It takes a function as argument and returns a data frame with rows belonging to removed groups removed. The passed function gets the group (series or data frame) and has to return True (keep group) or False (remove group).

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],
                   'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
                   'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

g.filter(lambda dfg: dfg['score'].mean() > 4)
age answer score
0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9
age answer score
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

Restructuring Without Aggregation#

There are three basic techniques for restructuring data in a data frame:

  • pivot (interprets two specified columns as row and column index)

  • stack/unstack (move (level of) column index to (level of) row index and vice versa)

  • melt (create new column from some column labels)

Details and graphical illustrations of these technique may be found in Pandas’ user guide (first three sections).

Restructuring With Aggregation#

Pandas supports pivot tables via pivot_table function. Pivot tables are almost the same as pivoting with pivot but allow for multiple values per data cell, which then are aggregated to one value.

Details may be found in Pandas’ user guide.

Similar functionality is provided by crosstab. See Pandas user guide, too.