Data Frames#

A Pandas data frame (a DataFrame object) is a collection of Pandas series with common index. Each series can be interpreted as a column in a two-dimensional table. There is a second index object for indexing columns.

Data frames are the most important data structure provided by Pandas.

import pandas as pd

Creation of DataFrame Objects#

A DataFrame object can be created from lists/dictionaries of lists/dictionaries or from NumPy arrays or from lists/dictionaries of series, for instance. See DataFrame constructor in Pandas’ documentation. If necessary row labels and column labels can be provided via index and columns keyword arguments, respectively.

s1 = pd.Series({'a': 1, 'b': 2})
s2 = pd.Series({'b': 3, 'c': 4})

df = pd.DataFrame({'left': s1, 'right': s2})

df
left right
a 1.0 NaN
b 2.0 3.0
c NaN 4.0
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['top', 'middle', 'bottom'],
                  columns=['left', 'middle', 'right'])

df
left middle right
top 1 2 3
middle 4 5 6
bottom 7 8 9

Hint

JupyterLab shows data frames as graphical table. In other words, Jupyter’s display function yields output different from print. With print we get a text representation of the data frame.

The shape member contains a tuple with number of rows and columns in the data frame:

df.shape
(3, 3)

Data Alignment#

Like for series data frames implement data alignment where possible. This applies to row indexing as well as to column indexing.

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                   index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
df2 = pd.DataFrame([[11, 12, 13], [14, 15, 16], [17, 18, 19]],
                   index=['b', 'c', 'd'], columns=['A', 'C', 'D'])

display(df1)
display(df2)
display(df1 + df2)
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A C D
b 11 12 13
c 14 15 16
d 17 18 19
A B C D
a NaN NaN NaN NaN
b 15.0 NaN 18.0 NaN
c 21.0 NaN 24.0 NaN
d NaN NaN NaN NaN

Underlying Data Structures#

The index object for row indexing is accessible via index member and the index object for column indexing is accessible via columns member.

Data frames also have a to_numpy method returning a data frame’s data as NumPy array.

Indexing#

Indexing data frames is very similar to indexing series. The major difference is that for data frames we have to specify a row and a column instead of only a row.

Overview#

There exist four widely used mechanisms (here df is some data frame):

  • df[...]: Python style indexing

  • df.ix[...]: old Pandas style indexing (removed from Pandas in January 2020)

  • df.loc[...] and df.iloc[...]: new Pandas style indexing

  • df.at[...] and df.iat[...]: new Pandas style indexing for more efficient access to single items

Python style indexing and old Pandas style indexing (the ix indexer) allow for position based indexing, label based indexing and mixed indexing (position for row, label for column, and vice versa). Both [...] and ix[...] behave slightly differently. As already discussed for series, sometimes it is not clear whether positional or label based indexing shall be used. Thus, [...] should be used with care. The ix indexer has been removed from Pandas in January 2020, but may appear in old code and documents.

Label Based Column Indexing with [...]#

Indexing with [...] is mainly used for selecting columns by label. For other purposes new Pandas style indexing should be used.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

s = df['A']
df_part = df[['B', 'C']]

display(s)
display(df_part)
A B C
a 1 2 3
b 4 5 6
c 7 8 9
a    1
b    4
c    7
Name: A, dtype: int64
B C
a 2 3
b 5 6
c 8 9

If only one label is provided, then the result is a Series object. If a list of labels is provided, then the result is a DataFrame object. In case of integer labels, label based indexing is used when selecting columns:

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b', 'c'], columns=[2, 23, 45])
display(df)
display(df[2])
2 23 45
a 1 2 3
b 4 5 6
c 7 8 9
a    1
b    4
c    7
Name: 2, dtype: int64

Hint

Label based column indexing can be used to create new columns: df['my_new_column'] = 0 creates a new column containing zeros, for instance.

Positional Indexing#

Positional indexing via iloc[...] or iat[...] works like for two-dimensional NumPy arrays.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

display(df.iloc[1:3, 0:2])             # slicing
display(df.iloc[[1, 0], [2, 1, 0]])    # list of indices
display(df.iloc[[True, False, True], [False, True, True]])    # boolean indexing
display(df.iat[2, 1])                  # efficient single element access
display(df.iloc[2, 1])                 # less efficient single element access
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B
b 4 5
c 7 8
C B A
b 6 5 4
a 3 2 1
B C
a 2 3
c 8 9
8
8

Variants (slicing, boolean and so on) may be mixed for rows and columns.

Label Based Indexing#

Label based indexing works like with dictionaries. But slicing is allowed.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

display(df.loc['c':'b':-1, 'A':'B'])       # slicing
display(df.loc[['c', 'a'], ['B', 'A']])    # list of labels
display(df.loc[[True, False, True], [False, True, True]])    # boolean indexing
display(df.at['b', 'B'])                   # efficient single element access
display(df.loc['b', 'B'])                  # less efficient single element access
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B
c 7 8
b 4 5
B A
c 8 7
a 2 1
B C
a 2 3
c 8 9
5
5

Important

Like for series label based slicing includes the stop label!

Indexing by Callables#

Both loc[...] and iloc[...] accept a function for row index and column index. The function has to take a data frame as argument and has to return something allowed for indexing (list of indices/labels, boolean array and so on).

Mixing Positional and Label Based Indexing#

Indexing with [...] allows for mixing positional and label based indexing. But [...] should be avoided as discussed for series. The only exception is column selection. With Pandas’ new indexing style mixed indexing is still possible but requires some more bytes of code:

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

display(df.loc['a':'b', df.columns[0:2]])
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B
a 1 2
b 4 5

The idea is to use loc[...] and the columns member variable. With columns we get access to the index object for column indexing. This index object allows for usual positional indexing techniques (slicing, boolean and so on). The result of indexing an index object is an index object again. But the new index object only contains the desired subset of indices. This smaller index object than is passed to loc[...]. Same is possible for rows via index member.

Views and Copies#

Like for NumPy arrays indexing Pandas data frames may return a view of the data frame. That is, modifying the extracted subset of items might modify the original data frame. If you really need a copy of the items, use the copy method of DataFrame objects.

Some Useful Member Functions#

A full list of member functions for DataFrame objects is provided in Pandas’ documentation. Here we only list a few.

A First Look at a Data Frame#

With describe we get basic statistical information about each column holding numerical values. The function returns a DataFrame object containing the collected information. Only columns with numerical data are considered by describe.

First and last rows are returned by head and tail. Both take an optional argument specifying the number of rows to return. Default is 5.

The info method prints memory usage and other useful information.

df = pd.DataFrame([[1, 2, 3, 'some'], [4, 5, 6, 'string'], [7, 8, 9, 'here']],
                  index=['a', 'b', 'c'], columns=['A', 'B', 'C', 'D'])
display(df)

display(df.describe())
display(df.head(2))
display(df.tail(2))
df.info()
A B C D
a 1 2 3 some
b 4 5 6 string
c 7 8 9 here
A B C
count 3.0 3.0 3.0
mean 4.0 5.0 6.0
std 3.0 3.0 3.0
min 1.0 2.0 3.0
25% 2.5 3.5 4.5
50% 4.0 5.0 6.0
75% 5.5 6.5 7.5
max 7.0 8.0 9.0
A B C D
a 1 2 3 some
b 4 5 6 string
A B C D
b 4 5 6 string
c 7 8 9 here
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       3 non-null      int64 
 1   B       3 non-null      int64 
 2   C       3 non-null      int64 
 3   D       3 non-null      object
dtypes: int64(3), object(1)
memory usage: 120.0+ bytes

Iterating Over a Data Frame#

To iterate over columns use items, which returns tuples containing the column label and the column’s data as Series object.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                  index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

for label, s in df.items():
    print(label)
    print(s, '\n')
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A
a    1
b    4
c    7
Name: A, dtype: int64 

B
a    2
b    5
c    8
Name: B, dtype: int64 

C
a    3
b    6
c    9
Name: C, dtype: int64 

Alternativly, iterate over the columns member variable:

for label in df.columns:
    print(label)
    print(df[label], '\n')
A
a    1
b    4
c    7
Name: A, dtype: int64 

B
a    2
b    5
c    8
Name: B, dtype: int64 

C
a    3
b    6
c    9
Name: C, dtype: int64 

Iteration over rows can be implemented via iterrows method. Analogously to items it returns tuples containing the row label and a Series object with the data.

for label, s in df.iterrows():
    print(label)
    print(s, '\n')
a
A    1
B    2
C    3
Name: a, dtype: int64 

b
A    4
B    5
C    6
Name: b, dtype: int64 

c
A    7
B    8
C    9
Name: c, dtype: int64 

Hint

Data in each column of a data frame has identical type, but types in a row may differ (from column to column). Thus, calling iterrows may involve type casting to get row data as Series object.

Important

Usually there is no need to iterate over rows, because Pandas provides much faster vectorized code for almost all operations needed for typical data science projects.

Vectorized Operators#

Mathematical operations and comparisons work in complete analogy to Series objects.

Removing and Adding Items#

Functions for removing and adding items:

  • concat (concatenate data frames vertically or horizontally),

  • drop (remove rows or columns from data frame).

Modifying Data in a Data Frame#

  • apply (apply function rowwise or columnwise to data frame),

  • combine (choose items from two data frames to form a new one),

  • where (replace items which do not satisfy a condition),

  • mask (replace items which satisfy a condition).

Data Frames from and to CSV Files#

The Pandas function read_csv reads a CSV file and returns a data frame.

The DataFrame method to_csv writes data to a CSV file.

Missing Values#

Missing values are a common problem in data science. Pandas provides several functions and mechanisms for handling missing values. Important functions:

  • isna (return boolean data frame with True at missing value positions),

  • notna (return boolean data frame with False at missing value positions),

  • fillna (fill all missing values, different fill methods are provided),

  • dropna (remove rows or columns containing missing values or consisting completely of missing values).

Details may be found in Pandas user guide.