Feature Reduction
Contents
Feature Reduction#
Feature reduction or dimensionality reduction aims at reducing the number of features in a data set without deterioration results of supervised learning methods. This saves resources (memory, CPU time), but also may help interpreting the data. We distinguish two classes of methods for feature reduction:
feature selection,
feature transform.
A straight forward idea is feature selection or removal. One tries to identify relevant features and removes features of little or no relevance. Relevance here means influnce on the accuracy of the trained model. If we remove a feature, but the model trained on the reduced data set yields almost identical predictions as before, then the feature is considered irrelevant.
Feature transform refers to more complex methods which map the whole feature space to a lower dimensional space. Here, resulting new features lack easy interpretability because each contains information from many different original features.
Here we consider two methods for feature selection and one feature transforms method.
Manual Feature Removal#
Exploratory data analysis gives a first impression of feature properties and relations between features. Observations made during this phase of doing data science should be used to reduce the number of features.
One observation frequenly made in raw data sets are features, that contain (almost) the same value in each sample. Such features can be removed without loss of information.
Another frequent observation are highly correlated features. Plotting one feature versus another we see whether there is a functional dependence between them or whether they are uncorrelated. In the first case the plot shows a line, in the second case we see a cloud of points.
In case of functional dependence one of the two considered features can be removed, because its values can be deduced from the remaining one. In other words, both features contain the same information, but in different representation.
There might also be features uncorrelated to the targets. If the plot of a feature against the targets shows a very homogeneous distribution of points all over the plotting area, then the feature is likely to have no influence on the targets. Thus, it can be removed. If in doubt, don’t remove the feature because eyes may trick you about equal distribution of points.
Permutation Feature Importance#
Given a trained model we ask for the importance of each feature. We could remove one feature, retrain the model and compare prediction quality to the original model. Doing this for all features we see which features can be removed without deteriorating results. The drawback of that approach is that we have to train as many models as we have features.
A similar idea, which does not require additional training, is calculating permutation feature importance. Given a data set we permutate the values of a fixed feature and look at corresponding predictions. The smaller the deviation from the original predictions the less important is the feature. So deviation in predictions after permutating feature values is a measure of feature importance.
Scikit-Learn supports permutation feature importance via permutation_importance
function in the inspection
module. We have to pass the trained model and a data set. The object returned by the function has a member importances
containing a NumPy array with the feature importances. By default the function calculates feature importances for several different permutations. Mean and standard deviation can be used to get a summary for all permuations. Both are accessible via the importances_mean
and importances_std
members, respectively.
Principal Component Analysis (PCA)#
Principal component analysis is a powerful technique for deriving new features from existing ones (feature transformation). Original features may suffer from two problems:
A feature might have low variance, that is, it contains few information. The extreme case is a feature with variance 0. Then all values of the feature are identical.
Features might be correlated, that is, they contain redundant information. The extreme case is a feature whoes values are a fixed multiple of another feature’s values.
Principal component analysis tackles both problems:
PCA creates features with high variance and removes low variance features.
Features created by PCA are uncorrelated.
We now go into the details of this important technique.
Prerequisites#
PCA only works for numerical features and all features should be centered. More sloppily we may say that PCA assumes that the data set is a point cloud centered at the origin.
Let \(x_1,\ldots,x_n\) be the feature vectors of the data set and assume we have \(m\) features. Writing the feature vectors as rows of a matrix we obtain a matrix \(X\) of size \(n\times m\). Denote the columns of \(X\) by \(\xi_1,\ldots,\xi_m\), that is, \(\xi_1\) contains all values of the first feature, \(\xi_2\) all values of the second feature, and so on; in formulas:
Let’s call \(\xi_1,\ldots,\xi_m\) value vectors.
Linear Feature Transforms#
Given numbers \(a^{(1)},\ldots,a^{(m)}\) the linear combination
of the value vectors \(\xi_1,\ldots,\xi_m\) can be considered the value vector of a new feature. Taking more such linear combinations we derive more and more features from the original ones.
Say we create \(\tilde{m}\) new features with value vectors \(\tilde{\xi}_1,\ldots,\tilde{\xi}_{\tilde{m}}\), then the corresponding matrix \(\tilde{X}\) of size \(n\times\tilde{m}\) may be written as
where each column of the transform matrix \(A\) contains the coefficients of the linear combination yielding the corresponding new feature. The first column of \(A\) contains the coefficients for \(\tilde{\xi}_1\), for instance. Dimensions of \(A\) are \(m\times\tilde{m}\). This can be seen by carrying out usual matrix multiplication: the first column of \(X\,A\) is \(X\) times the first column of \(A\) and so on.
If the transform matrix \(A\) is invertible, we may write
That is, we may switch back and forth between both feature sets as we like. Obviously, invertibility of \(A\) requires at least \(m=\tilde{m}\).
If the transform matrix \(A\) is invertible, then both feature sets contain identical information, but in different representation. There exist several approaches for contructing transform matrices. PCA is only one. Like PCA, most approaches aim at removing features carrying only few information. Thus, \(A\) will not be invertible.
Variance Maximization#
By assumption the original value vectors \(\xi_1,\ldots,\xi_m\) are centered (mean 0). One easily verifies, that derived value vectors then are centered (mean 0), too.
The idea of PCA is to find coefficient vectors \(a\) such that the corresponding new value vector \(X\,a\) has maximum variance with respect to all vectors \(a\) of length 1. Considering other lengths, too, would add no value. Looking at length 2, for instance, would yield a factor 4 in all variances (factor 2 for standard deviation), but the maximizing vector \(a\) would have the same direction as for length 1.
Since \(X\,a\) is centered, its variance is simply the sum of the squares of its components (up to constant factor); in other words: its squared length. Taking the square root does not change the maximizer. Thus, maximizing variance boils down to solving
Some deeper mathematics reveals that maximum variance is given by the largest eigenvalue of the matrix \(X^{\mathrm{T}}\,X\) and that the maximizer \(a^\ast\) is a corresponding eigenvector. The square matrix \(X^{\mathrm{T}}\,X\) is known as covariance matrix in statistics.
The optimal coefficient vector \(a^\ast\) yields a derived feature which contains a maximum of information compared to all other derivable features. Such a feature is called a principal component of the original features.
Geometric Interpretation#
Although we derived the PCA procedure from statistical considerations PCA has a nice geometric interpretation.
A value vector \(\xi=X\,a\) of a derived feature contains the inner products of all samples with the coefficient vector \(a\). The coefficient vector \(a\) ‘lives’ in the same space as the samples: \(\mathbb{R}^m\). Thus, maximizing \(X\, a\) can be regarded as looking for a direction (1d subspace) such that the orthogonal projections of all samples onto that subspace have maximum range.
To get the second principal component we project all samples onto the \(m-1\) dimensional subspace orthogonal to the first principal component and then look for the projection range maximizing direction inside this subspace.
Following this procedure until the space is exhausted we obtain a set of orthogonal unit vectors in the original feature space. In other words, the \(a^\ast_1,a^\ast_2,\ldots\) define a new coordinate system. The transformed samples \(\tilde{x}_1,\ldots,\tilde{x}_n\) in \(\tilde{X}=X\,A\) simply are the original samples but expressed with respect to the new coordinate system.
Removing Low Variance Features#
PCA yields a list of uncorrelated centered features and their variances. Thus, me may remove low variance features from the list to reduce resource consumption when applying machine learning algorithms.
The remaining coefficient vectors in the list make up the transform matrix \(A\) introduced above.
Relation to Singular Value Decomposition (SVD)#
PCA looks for eigenvalues and eigenvectors of \(X^\mathrm{T}\,X\). This matrix is square, symmetric, and positive semidefinite. Thus, all eigenvalues are nonnegative real numbers.
The matrix \(X\) itself lacks all those nice properties. Nonetheless, there is a transform very similar to PCA which applies directly to \(X\), the singular value decomposition. SVD yields two orthonormal systems \(\{u_1,\ldots,u_n\}\) and \(\{v_1,\ldots,v_m\}\) and a list of nonnegative real numbers \(s_1,\ldots,s_m\) such that
Here \(S\) is the ‘diagonal’ matrix of size \(n\times m\) with \(s_1,\ldots,s_m\) on its diagonal, \(U\) and \(V\) are the matrices having \(u_1,\ldots,u_n\) and \(v_1,\ldots,v_m\) as columns, respectively.
We have the following relations to PCA:
\(v_\kappa=a^\ast_\kappa\) (normalized eigenvectors of \(X^{\mathrm{T}}\,X\)),
\(u_\kappa=\frac{X\,a^\ast_\kappa}{|X\,a^\ast_\kappa|}\) (normalized principal components or eigenvectors of \(X\,X^{\mathrm{T}}\)),
\(s_\kappa=|X\,a^\ast_\kappa|\) (length of principal components or square roots of eigenvalues of \(X^{\mathrm{T}}\,X\) and \(X\,X^{\mathrm{T}}\)).
Note that \(U\) and \(V\) are orthonormal matrices, implying \(U^{-1}=U^{\mathrm{T}}\) and \(V^{-1}=V^{\mathrm{T}}\). Thus,
Some of the \(s_1,\ldots,s_m\) may be zero, indicating that the number of features can be reduced without loss of information. If some of the \(s_1,\ldots,s_m\) are very small, then setting them to zero reduces the number of features while neglecting only very few information (truncated singular value decomposition or TSVD).
PCA with Scikit-Learn#
Scikit-Learn implements a PCA class in its decomposition
module.
The constructor takes the number n_components
of principal components to keep (None
for all).
The fit
method computes the principle components and transform
yields transformed data.
After fitting the PCA object offers principal components and corresponding variances as member variables components_
and explained_variance_
.
Scikit-Learn’s PCA automatically centers the input data and stores each component’s mean in the PCA
object’s mean_
member.
We start an illustrating example with all necessary imports and a function for plotting a point cloud.
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import sklearn.preprocessing as preprocessing
import sklearn.decomposition as decomposition
from numpy.random import default_rng
rng = default_rng(0)
def plot_data(X, xlabel, ylabel, zlabel):
''' Scatter plot of data with properly configured axes. '''
fig = go.Figure()
fig.layout.width = 800
fig.layout.height = 600
fig.add_trace(go.Scatter3d(
x=X[:, 0], y=X[:, 1], z=X[:, 2],
mode='markers',
marker={'color': '#0000ff', 'size': 1}
))
data_min = X.min()
data_max = X.max()
fig.update_scenes(
xaxis_title_text=xlabel,
yaxis_title_text=ylabel,
zaxis_title_text=zlabel,
xaxis_range=[data_min, data_max],
yaxis_range=[data_min, data_max],
zaxis_range=[data_min, data_max]
)
fig.update_layout(scene_aspectmode='cube', showlegend=False)
return fig
We use simulated data with 3 features. For simple simulation we generate data following a multivariate normal distribution.
n = 500 # number of data points to generate
# generate data
X = rng.multivariate_normal([1, 2, 3], [[1, 0.2, 0.3], [0.2, 2, -0.1], [0.3, -0.1, 0.2]], n)
# plot data
fig = plot_data(X, 'feature 1', 'feature 2', 'feature 3')
fig.show()
Now we do a full PCA and plot the principal components. That is, we keep all principle components. In this case we do not have to pass an argument to the PCA
constructor. Note that we multiply the length of the principle components by 3 for better visibility.
pca = decomposition.PCA()
pca.fit(X)
# plot
fig = plot_data(X, 'feature 1', 'feature 2', 'feature 3')
# plot principal components
vec = np.stack((pca.mean_, np.empty(3)), axis=1)
for i in range(0, 3):
vec[:, 1] = vec[:, 0] + 3 * np.sqrt(pca.explained_variance_[i]) * pca.components_[i, :]
fig.add_trace(go.Scatter3d(
x=vec[0, :], y=vec[1, :], z=vec[2, :],
mode='lines',
line={'color': '#ff0000', 'width': 5}
))
fig.show()
It remains to transform the data according to the principal components.
X_transformed = pca.transform(X)
# plot
fig = plot_data(X_transformed, 'PCA feature 1', 'PCA feature 2', 'PCA feature 3')
# plot principal components
vec = np.stack((np.zeros(3), np.empty(3)), axis=1)
for i in range(0, 3):
vec[:, 1] = 0
vec[i, 1] = 3 * np.sqrt(pca.explained_variance_[i])
fig.add_trace(go.Scatter3d(
x=vec[0, :], y=vec[1, :], z=vec[2, :],
mode='lines',
line={'color': '#ff0000', 'width': 5}
))
fig.show()
If we want to reduce the number of features we may omit the last component (the one with smallest variance), resulting in a two-dimensional data set. To make PCA.transform
do this for us we have to restart PCA with n_components=2
.
pca = decomposition.PCA(n_components=2)
pca.fit(X)
X_transformed = pca.transform(X)
# plot transformed data
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], c='#0000ff', s=3)
ax.set_xlabel('PCA feature 1')
ax.set_ylabel('PCA feature 2')
ax.set_aspect('equal')
plt.show()
Nonlinear Feature Reduction Methods#
PCA is a linear method, that is, we apply a linear transform to the data. There exist many nonlinear methods for feature reduction. Such nonlinear methods are considered a separate branch of unsupervised machine learning and will be considered later on.