Restructuring and aggregation are two basic methods for extracting statistical information from data. We start with groupwise aggregation and then discuss several forms of restructuring without and with additional aggregation.
Grouping
Grouping is the first step in the so-called split-apply-combine procedure in data processing. Data is split into groups by some criterion, then some function is applied to each group, finally results get (re-)combinded. Typical functions in the apply step are sum or mean (more general: aggregation) or any type of transform or filtering functions (drop groups containing nan
items, for instance).
This chapter follows the structure of the Pandas user guide, but leaves out sections on very specific details. Feel free to have a look at those details later on.
Splitting into Groups and Basic Usage
Grouping is done by calling the groupby
method of a series or data frame. It takes a column label or a list of column labels as argument and returns a SeriesGroupBy
or DataFrameGroupBy
object. The returned object represents a kind of list of groups, each group being a small series or data frame. All rows in a group have identical values in the columns used for grouping.
The ...GroupBy
object offers several methods for working with the determined groups. Iterating over such objects is possible, too.
Grouping by one column and subsequent aggregation yields an index with values from the column used for grouping:
|
age |
score |
0 |
2 |
2.3 |
1 |
3 |
4.5 |
2 |
3 |
3.4 |
3 |
2 |
2.0 |
4 |
4 |
5.4 |
5 |
5 |
7.2 |
6 |
5 |
2.8 |
7 |
5 |
3.9 |
|
age |
score |
0 |
2 |
2.3 |
3 |
2 |
2.0 |
|
age |
score |
1 |
3 |
4.5 |
2 |
3 |
3.4 |
|
age |
score |
5 |
5 |
7.2 |
6 |
5 |
2.8 |
7 |
5 |
3.9 |
|
score |
age |
|
2 |
2.150000 |
3 |
3.950000 |
4 |
5.400000 |
5 |
4.633333 |
Grouping by two columns and subsequent aggregation yields a multi-level index:
|
age |
answer |
score |
0 |
2 |
yes |
2.3 |
1 |
3 |
no |
4.5 |
2 |
3 |
no |
3.4 |
3 |
2 |
no |
2.0 |
4 |
4 |
no |
5.4 |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
7 |
5 |
no |
3.9 |
|
age |
answer |
score |
3 |
2 |
no |
2.0 |
|
age |
answer |
score |
0 |
2 |
yes |
2.3 |
|
age |
answer |
score |
1 |
3 |
no |
4.5 |
2 |
3 |
no |
3.4 |
|
age |
answer |
score |
4 |
4 |
no |
5.4 |
|
age |
answer |
score |
7 |
5 |
no |
3.9 |
|
age |
answer |
score |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
|
|
score |
age |
answer |
|
2 |
no |
2.00 |
yes |
2.30 |
3 |
no |
3.95 |
4 |
no |
5.40 |
5 |
no |
3.90 |
yes |
5.00 |
Grouping by levels of a multi-level index is possible by providing the level
argument to groupby
.
With get_group
we have access to single groups:
|
age |
answer |
score |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
DataFrameGroupBy
objects allow for column indexing:
|
age |
answer |
score |
0 |
2 |
yes |
2.3 |
1 |
3 |
no |
4.5 |
2 |
3 |
no |
3.4 |
3 |
2 |
no |
2.0 |
4 |
4 |
no |
5.4 |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
7 |
5 |
no |
3.9 |
5 yes
6 yes
7 no
Name: answer, dtype: object
Aggregation
To apply a function to each column of each group use aggregate
. It takes a function or a list of functions as argument. Providing a dictionary of column: function
pairs allows for column specific functions.
|
age |
answer |
score |
0 |
2 |
yes |
2.3 |
1 |
3 |
no |
4.5 |
2 |
3 |
no |
3.4 |
3 |
2 |
no |
2.0 |
4 |
4 |
no |
5.4 |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
7 |
5 |
no |
3.9 |
|
answer |
score |
age |
|
|
2 |
no |
2.0 |
3 |
no |
3.4 |
4 |
no |
5.4 |
5 |
no |
2.8 |
|
answer |
score |
|
amin |
amax |
amin |
amax |
age |
|
|
|
|
2 |
no |
yes |
2.0 |
2.3 |
3 |
no |
no |
3.4 |
4.5 |
4 |
no |
no |
5.4 |
5.4 |
5 |
no |
yes |
2.8 |
7.2 |
|
answer |
score |
age |
|
|
2 |
no |
2.150000 |
3 |
no |
3.950000 |
4 |
no |
5.400000 |
5 |
no |
4.633333 |
With size
we get group sizes.
age
2 2
3 2
4 1
5 3
dtype: int64
Many aggregation functions are directly accessible from the ...GroupBy
object. Examples are ...GroupBy.sum
and ...GroupBy.mean
. See Computations / descriptive stats for a complete list.
Filtering
To remove groups use filter
method. It takes a function as argument and returns a data frame with rows belonging to removed groups removed. The passed function gets the group (series or data frame) and has to return True
(keep group) or False
(remove group).
|
age |
answer |
score |
0 |
2 |
yes |
2.3 |
1 |
3 |
no |
4.5 |
2 |
3 |
no |
3.4 |
3 |
2 |
no |
2.0 |
4 |
4 |
no |
5.4 |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
7 |
5 |
no |
3.9 |
|
age |
answer |
score |
4 |
4 |
no |
5.4 |
5 |
5 |
yes |
7.2 |
6 |
5 |
yes |
2.8 |
7 |
5 |
no |
3.9 |
Restructuring With Aggregation
Pandas supports pivot tables via pivot_table
function. Pivot tables are almost the same as pivoting with pivot
but allow for multiple values per data cell, which then are aggregated to one value.
Details may be found in Pandas’ user guide.
Similar functionality is provided by crosstab
. See Pandas user guide, too.