Data Frames
Contents
Data Frames#
A Pandas data frame (a DataFrame
object) is a collection of Pandas series with common index. Each series can be interpreted as a column in a two-dimensional table. There is a second index object for indexing columns.
Data frames are the most important data structure provided by Pandas.
import pandas as pd
Creation of DataFrame
Objects#
A DataFrame
object can be created from lists/dictionaries of lists/dictionaries or from NumPy arrays or from lists/dictionaries of series, for instance. See DataFrame constructor in Pandas’ documentation. If necessary row labels and column labels can be provided via index
and columns
keyword arguments, respectively.
s1 = pd.Series({'a': 1, 'b': 2})
s2 = pd.Series({'b': 3, 'c': 4})
df = pd.DataFrame({'left': s1, 'right': s2})
df
left | right | |
---|---|---|
a | 1.0 | NaN |
b | 2.0 | 3.0 |
c | NaN | 4.0 |
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['top', 'middle', 'bottom'],
columns=['left', 'middle', 'right'])
df
left | middle | right | |
---|---|---|---|
top | 1 | 2 | 3 |
middle | 4 | 5 | 6 |
bottom | 7 | 8 | 9 |
Hint
JupyterLab shows data frames as graphical table. In other words, Jupyter’s display
function yields output different from print
. With print
we get a text representation of the data frame.
The shape
member contains a tuple with number of rows and columns in the data frame:
df.shape
(3, 3)
Data Alignment#
Like for series data frames implement data alignment where possible. This applies to row indexing as well as to column indexing.
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
df2 = pd.DataFrame([[11, 12, 13], [14, 15, 16], [17, 18, 19]],
index=['b', 'c', 'd'], columns=['A', 'C', 'D'])
display(df1)
display(df2)
display(df1 + df2)
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
A | C | D | |
---|---|---|---|
b | 11 | 12 | 13 |
c | 14 | 15 | 16 |
d | 17 | 18 | 19 |
A | B | C | D | |
---|---|---|---|---|
a | NaN | NaN | NaN | NaN |
b | 15.0 | NaN | 18.0 | NaN |
c | 21.0 | NaN | 24.0 | NaN |
d | NaN | NaN | NaN | NaN |
Underlying Data Structures#
The index object for row indexing is accessible via index
member and the index object for column indexing is accessible via columns
member.
Data frames also have a to_numpy
method returning a data frame’s data as NumPy array.
Indexing#
Indexing data frames is very similar to indexing series. The major difference is that for data frames we have to specify a row and a column instead of only a row.
Overview#
There exist four widely used mechanisms (here df
is some data frame):
df[...]
: Python style indexingdf.ix[...]
: old Pandas style indexing (removed from Pandas in January 2020)df.loc[...]
anddf.iloc[...]
: new Pandas style indexingdf.at[...]
anddf.iat[...]
: new Pandas style indexing for more efficient access to single items
Python style indexing and old Pandas style indexing (the ix
indexer) allow for position based indexing, label based indexing and mixed indexing (position for row, label for column, and vice versa).
Both [...]
and ix[...]
behave slightly differently. As already discussed for series, sometimes it is not clear whether positional or label based indexing shall be used. Thus, [...]
should be used with care. The ix
indexer has been removed from Pandas in January 2020, but may appear in old code and documents.
Label Based Column Indexing with [...]
#
Indexing with [...]
is mainly used for selecting columns by label. For other purposes new Pandas style indexing should be used.
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)
s = df['A']
df_part = df[['B', 'C']]
display(s)
display(df_part)
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
a 1
b 4
c 7
Name: A, dtype: int64
B | C | |
---|---|---|
a | 2 | 3 |
b | 5 | 6 |
c | 8 | 9 |
If only one label is provided, then the result is a Series
object. If a list of labels is provided, then the result is a DataFrame
object. In case of integer labels, label based indexing is used when selecting columns:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=[2, 23, 45])
display(df)
display(df[2])
2 | 23 | 45 | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
a 1
b 4
c 7
Name: 2, dtype: int64
Hint
Label based column indexing can be used to create new columns: df['my_new_column'] = 0
creates a new column containing zeros, for instance.
Positional Indexing#
Positional indexing via iloc[...]
or iat[...]
works like for two-dimensional NumPy arrays.
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)
display(df.iloc[1:3, 0:2]) # slicing
display(df.iloc[[1, 0], [2, 1, 0]]) # list of indices
display(df.iloc[[True, False, True], [False, True, True]]) # boolean indexing
display(df.iat[2, 1]) # efficient single element access
display(df.iloc[2, 1]) # less efficient single element access
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
A | B | |
---|---|---|
b | 4 | 5 |
c | 7 | 8 |
C | B | A | |
---|---|---|---|
b | 6 | 5 | 4 |
a | 3 | 2 | 1 |
B | C | |
---|---|---|
a | 2 | 3 |
c | 8 | 9 |
8
8
Variants (slicing, boolean and so on) may be mixed for rows and columns.
Label Based Indexing#
Label based indexing works like with dictionaries. But slicing is allowed.
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)
display(df.loc['c':'b':-1, 'A':'B']) # slicing
display(df.loc[['c', 'a'], ['B', 'A']]) # list of labels
display(df.loc[[True, False, True], [False, True, True]]) # boolean indexing
display(df.at['b', 'B']) # efficient single element access
display(df.loc['b', 'B']) # less efficient single element access
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
A | B | |
---|---|---|
c | 7 | 8 |
b | 4 | 5 |
B | A | |
---|---|---|
c | 8 | 7 |
a | 2 | 1 |
B | C | |
---|---|---|
a | 2 | 3 |
c | 8 | 9 |
5
5
Important
Like for series label based slicing includes the stop label!
Indexing by Callables#
Both loc[...]
and iloc[...]
accept a function for row index and column index. The function has to take a data frame as argument and has to return something allowed for indexing (list of indices/labels, boolean array and so on).
Mixing Positional and Label Based Indexing#
Indexing with [...]
allows for mixing positional and label based indexing. But [...]
should be avoided as discussed for series. The only exception is column selection. With Pandas’ new indexing style mixed indexing is still possible but requires some more bytes of code:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)
display(df.loc['a':'b', df.columns[0:2]])
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
A | B | |
---|---|---|
a | 1 | 2 |
b | 4 | 5 |
The idea is to use loc[...]
and the columns
member variable. With columns
we get access to the index object for column indexing. This index object allows for usual positional indexing techniques (slicing, boolean and so on). The result of indexing an index object is an index object again. But the new index object only contains the desired subset of indices. This smaller index object than is passed to loc[...]
. Same is possible for rows via index
member.
Some Useful Member Functions#
A full list of member functions for DataFrame
objects is provided in Pandas’ documentation. Here we only list a few.
A First Look at a Data Frame#
With describe
we get basic statistical information about each column holding numerical values. The function returns a DataFrame
object containing the collected information. Only columns with numerical data are considered by describe
.
First and last rows are returned by head
and tail
. Both take an optional argument specifying the number of rows to return. Default is 5.
The info
method prints memory usage and other useful information.
df = pd.DataFrame([[1, 2, 3, 'some'], [4, 5, 6, 'string'], [7, 8, 9, 'here']],
index=['a', 'b', 'c'], columns=['A', 'B', 'C', 'D'])
display(df)
display(df.describe())
display(df.head(2))
display(df.tail(2))
df.info()
A | B | C | D | |
---|---|---|---|---|
a | 1 | 2 | 3 | some |
b | 4 | 5 | 6 | string |
c | 7 | 8 | 9 | here |
A | B | C | |
---|---|---|---|
count | 3.0 | 3.0 | 3.0 |
mean | 4.0 | 5.0 | 6.0 |
std | 3.0 | 3.0 | 3.0 |
min | 1.0 | 2.0 | 3.0 |
25% | 2.5 | 3.5 | 4.5 |
50% | 4.0 | 5.0 | 6.0 |
75% | 5.5 | 6.5 | 7.5 |
max | 7.0 | 8.0 | 9.0 |
A | B | C | D | |
---|---|---|---|---|
a | 1 | 2 | 3 | some |
b | 4 | 5 | 6 | string |
A | B | C | D | |
---|---|---|---|---|
b | 4 | 5 | 6 | string |
c | 7 | 8 | 9 | here |
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null int64
2 C 3 non-null int64
3 D 3 non-null object
dtypes: int64(3), object(1)
memory usage: 120.0+ bytes
Iterating Over a Data Frame#
To iterate over columns use items
, which returns tuples containing the column label and the column’s data as Series
object.
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)
for label, s in df.items():
print(label)
print(s, '\n')
A | B | C | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
A
a 1
b 4
c 7
Name: A, dtype: int64
B
a 2
b 5
c 8
Name: B, dtype: int64
C
a 3
b 6
c 9
Name: C, dtype: int64
Alternativly, iterate over the columns
member variable:
for label in df.columns:
print(label)
print(df[label], '\n')
A
a 1
b 4
c 7
Name: A, dtype: int64
B
a 2
b 5
c 8
Name: B, dtype: int64
C
a 3
b 6
c 9
Name: C, dtype: int64
Iteration over rows can be implemented via iterrows
method. Analogously to items
it returns tuples containing the row label and a Series
object with the data.
for label, s in df.iterrows():
print(label)
print(s, '\n')
a
A 1
B 2
C 3
Name: a, dtype: int64
b
A 4
B 5
C 6
Name: b, dtype: int64
c
A 7
B 8
C 9
Name: c, dtype: int64
Hint
Data in each column of a data frame has identical type, but types in a row may differ (from column to column). Thus, calling iterrows
may involve type casting to get row data as Series
object.
Important
Usually there is no need to iterate over rows, because Pandas provides much faster vectorized code for almost all operations needed for typical data science projects.
Vectorized Operators#
Mathematical operations and comparisons work in complete analogy to Series
objects.
Removing and Adding Items#
Functions for removing and adding items:
Modifying Data in a Data Frame#
Data Frames from and to CSV Files#
The Pandas function read_csv
reads a CSV file and returns a data frame.
The DataFrame
method to_csv
writes data to a CSV file.
Missing Values#
Missing values are a common problem in data science. Pandas provides several functions and mechanisms for handling missing values. Important functions:
isna
(return boolean data frame withTrue
at missing value positions),notna
(return boolean data frame withFalse
at missing value positions),fillna
(fill all missing values, different fill methods are provided),dropna
(remove rows or columns containing missing values or consisting completely of missing values).
Details may be found in Pandas user guide.