Performance Issues#

Similar to the discussion in Efficiency Considerations for NumPy with Pandas we have to take care of how we implement certain operations, at least if performance matters. NumPy guidlines carry over to Pandas, but some additional remarks are in order.

import pandas as pd

Vectorization#

Analogously to NumPy, in Pandas we should avoid iterating over rows of series or data frames. Almost always vectorization is possible. For numeric columns Pandas relies on NumPy’s vectorized function calls. For string and date/time data Pandas implements tailor-made vectorization techniques.

Vectorized String Operations#

Indices, series and data frame columns containing string data have a member str providing typical string operations. Calling such a method applies the operation to each data item.

s = pd.Series(['abc', 'def', 'ghijklmn'])

s.str.upper()

       ABC
       DEF
  GHIJKLMN
dtype: object

See Pandas’ user guide for a list of supported string operations.

Vectorized Date/Time Operations#

Indices, series and data frame columns containing timestamp data have a member dt providing typical date/time operations. Calling such a method applies the operation to each data item.

s = pd.Series([pd.Timestamp(2022, 12, 24), pd.Timestamp(2022, 12, 25), pd.Timestamp(2022, 12, 26)])

s.dt.dayofweek

  5
  6
  0
dtype: int64

See Pandas’ user guide for available methods.

Accelerating Code Execution#

Pandas has a function eval which executes Python-like code provided as string. Due to (very complicated CPU caching and other) optimization techniques eval is faster for long expressions involving large data frames than standard Python code. The DataFrame.query method provides a simplified interface to eval for selecting rows via boolean operations on columns.

Both methods should only be used for operations on very large data frames. For small data frames they are significantly slower than standard Python. Have look at Expression evaluation via eval() in Pandas’ user guide for details.

Very Large Data Sets#

Sometimes data sets are too large to load the whole data set to memory. Pandas supports partial loading and there are other Pandas-like Python libraries supporting data sets larger than memory.

Partial Loading#

The pd.read_csv function supports chunking, that is, loading data in chunks. After processing a chunk it gets removed from memory and the next chunk can be read to memory. See Iterating through files chunk by chunk in Pandas’ user guide.

Other Libraries#

Dask is a parallel computing library with Pandas-like API. It allows for faster processing of large data sets. Have a look at Use other libraries in Pandas’ user guide for a quick introduction.

Performance Issues

Contents