Categorical Data#

Next to numerical and string data one frequently encounters categorical data. That is data of whatever type with finite range. Admissible values are called categories. There are two kinds of categorical data:

  • nominal data (finitely many different values without any order)

  • ordinal data (finitely many different values with linear order)

Examples:

  • colors red, blue, green, yellow (nominal)

  • business days Monday, Tuesday, Wednesday, Thursday, Friday (ordnial)

Pandas provides explicit support for categorical data and indices. Major advantages of categorical data compared to string data are lower memory consumption and more meaningful source code.

import pandas as pd

Creating Categorical Data#

Pandas has a class Categorical to hold a list of categorical data with (ordinal) or without (nominal) ordering. Such Categorical objects can directly be converted to series or columns of a data frame. Almost always category labels are strings, but any other data type is allowed, too.

cat_data = pd.Categorical(['red', 'green', 'blue', 'green', 'green'],
                          categories=['red', 'green', 'blue'], ordered=False)

s = pd.Series(cat_data)
s
0      red
1    green
2     blue
3    green
4    green
dtype: category
Categories (3, object): ['red', 'green', 'blue']

Passing dtype='category' to series or data frame constructors works, too. Categories then are determined automatically.

s = pd.Series(['red', 'green', 'blue', 'green', 'green'], dtype='category')
s
0      red
1    green
2     blue
3    green
4    green
dtype: category
Categories (3, object): ['blue', 'green', 'red']

Or we may convert an existing series or data frame column to categorical type.

s = pd.Series(['red', 'green', 'blue', 'green', 'green'])
s = s.astype('category')
s
0      red
1    green
2     blue
3    green
4    green
dtype: category
Categories (3, object): ['blue', 'green', 'red']

Automatically determined categories always are unordered (nominal).

Advantage of ordered categories is that we may use min and max functions for corresponding data.

quality = pd.Series(pd.Categorical(['poor', 'good', 'excellent', 'good', 'very good', 'poor'],
                                   categories=['very poor', 'poor', 'good', 'very good', 'excellent'],
                                   ordered=True))

print(quality.min())
print(quality.max())
poor
excellent

Custom Categorical Types#

Instead of using general categorical data type we may define new categorical types. Strictly speaking categorical isn’t a well defined type because we have to provide the category labels to obtain a full-fledged data type. A more natural way for using categories is to define a data type for each set of categories via CategoricalDtype.

A further advantage is that the same set of categories can be used for several series and data frames simultaneously.

colors = pd.CategoricalDtype(['red', 'green', 'blue', 'yellow'], ordered=False)

s = pd.Series(['red', 'red', 'black', 'blue'], dtype=colors)
s
0     red
1     red
2     NaN
3    blue
dtype: category
Categories (4, object): ['red', 'green', 'blue', 'yellow']

Values not covered by the categorical type are set to NaN.

Encoding Categorical Data for Machine Learning#

Most machine learning algorithms expect numerical input. Thus, categorical data has to be converted to numerical data first.

For ordinal data one might use numbers 1, 2, 3,… instead of the original category labels. But for nominal data the natural ordering of integers adds artificial structure to the data, which might affect an algorithm’s behavior. Thus, one hot encoding usually is used for converting nominal data to numerical data.

The idea is to replace a variable holding one of \(n\) categories by \(n\) boolean variables. Each new variable corresponds to one category. Exactly one variable is set to True. Pandas supports this conversion via get_dummies function.

colors = pd.CategoricalDtype(['red', 'green', 'blue', 'yellow'], ordered=False)

s = pd.Series(['red', 'red', 'green', 'blue'], dtype=colors)
print(s)

df = pd.get_dummies(s)
df
0      red
1      red
2    green
3     blue
dtype: category
Categories (4, object): ['red', 'green', 'blue', 'yellow']
red green blue yellow
0 1 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0

Modifying Categories#

Series or data frame columns with categorical data have a cat member providing access to the set of categories. Some member functions are:

Categorical Data and CSV Files#

Information about categories cannot be stored in CSV files. Instead, category labels are written to the CSV file in their native data type. When reading CSV data to a data frame, columns have to be converted to categorical types again, if desired.

Categorical Indices#

Pandas supports categorical indices via CategoricalIndex objects. Simply pass a Categorical object as index when creating a series or a data frame.

quality = pd.Categorical(['poor', 'good', 'excellent', 'good', 'very good', 'poor'],
                         categories=['very poor', 'poor', 'good', 'very good', 'excellent'],
                         ordered=True)
s = pd.Series([3, 4, 2, 23, 41, 5], index=quality)
print(s, '\n')

s = s.sort_index()
s
poor          3
good          4
excellent     2
good         23
very good    41
poor          5
dtype: int64 
poor          3
poor          5
good          4
good         23
very good    41
excellent     2
dtype: int64

Data access works as usual.

print(s.loc['poor'], '\n')
print(s.loc['poor':'very good'])
poor    3
poor    5
dtype: int64 

poor          3
poor          5
good          4
good         23
very good    41
dtype: int64

Categories by Binning#

Continuous data or discrete data with too large range can be converted to categories by providing a list of intervals (bins) in which items shall be placed. Each bin can be regarded as a category. Binning is important for machine learning tasks which require discrete data. The pd.cut function implements binning.