Kernel PCA#

Principal component analysis (PCA) contructs a new coordinate system in the data space such that variance is maximal along the first axis of this new coordinate system. Variance along the other axes decreases with increasing axis index. Coordinates of the data set with respect to the new coordinate system have zero covariance.

Such data adapted coordinate system at hand data dimensionality can be reduced by dropping axes of low variance or, in other words, by projecting onto the subspace spanned by the first few axes.

While classical PCA is a linear technique (data transformation by matrix multiplication) kernel PCA extends the basic idea to nonlinear transforms by exploiting the kernel trick. We already met the kernel trick when discussing support vector machines. In theory data is (nonlinearly) embedded into a higher dimensional space and PCA together with corresponding dimensionality reduction is performed there. But in practice, all computations are carried out in the original space by replacing inner products in high dimensions by special kernel functions.

In the following we always count eigenvalues with their multiplicity. So each real symmetric matrix has as many eigenvalues as it has rows and columns.

Recap of PCA#

Denote by \(X\in\mathbb{R}^{n\times m}\) the matrix containing all samples \(x_1,\ldots,x_n\) as rows. Here \(m\) is the dimension of the data space. We assume that data is centered, that is,

\[\begin{equation*} \sum_{l=1}^n x_l=0. \end{equation*}\]

The (up to a factor \(\frac{1}{n-1}\)) covariance matrix \(X^{\mathrm{T}}\,X\in\mathbb{R}^{m\times m}\) is symmetric and positive semi-definite. Thus, there exist \(m\) nonnegative numbers

\[\begin{equation*} \lambda_1\geq\lambda_2\geq\cdots\geq\lambda_m\geq 0\qquad\text{(eigenvalues)} \end{equation*}\]

and \(m\) mutually orthogonal normalized vectors \(a_1,\ldots,a_m\) (eigenvectors) such that

\[\begin{equation*} X^{\mathrm{T}}\,X\,a_k=\lambda_k\,a_k\qquad\text{for $k=1,\ldots,m$}. \end{equation*}\]

The vectors \(a_k\) form the coordinate axes of the PCA coordinate system and the values \(\lambda_k\) are the variances (up to a factor \(\frac{1}{n-1}\)) of the data set along those axes. Denoting by \(A\in\mathbb{R}^{m\times m}\) the orthogonal matrix of eigenvectors (as columns) the coordinate matrix \(\tilde{X}\) with respect to the new coordinate system is given by

\[\begin{equation*} \tilde{X}=X\,A. \end{equation*}\]

Equivalently, for each sample \(x_l\) and its transformed variant \(\tilde{x}_l\) we have the relations

\[\begin{equation*} \tilde{x}_l=\begin{bmatrix}{x_l}^{\mathrm{T}}\,a_1\\\vdots\\{x_l}^{\mathrm{T}}\,a_m\end{bmatrix}\qquad\text{and}\qquad x_l=\tilde{x}_l^{(1)}\,a_1+\cdots+\tilde{x}_l^{(m)}\,a_m. \end{equation*}\]

Dimensionality reduction now boils down to replacing vectors \(a_k\) corresponding to small \(\lambda_k\) by zero vectors. The transform from \(x_l\) to \(\tilde{x}_l\) and back to \(x_l\) then does not yield \(x_l\) again, but its projection onto the subspace spanned by the remaining \(a_k\).

PCA with Kernel Trick#

Kernel PCA appeared first in the 1996 report Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Introducing a (nonlinear) mapping \(\Phi:\mathbb{R}^m\rightarrow\mathbb{R}^q\) from the \(m\) dimensional data space into a much higher dimensional embedding space of dimension \(q\) with \(q\geq m\) the authors propose to perform PCA based dimensionality reduction for \(\{\Phi(x_1),\ldots,\Phi(x_n)\}\), the embedded data set. They have shown that PCA transformed data (in the \(q\) dimensional space) can be computed without touching \(q\) dimensional vectors and without expensive computations in \(q\) dimensions.

The Kernel PCA Theorem#

Given the embedding function \(\Phi\) as above we define the kernel matrix \(K\in\mathbb{R}^{n\times n}\) by

\[\begin{equation*} K_{l,\lambda}:=\Phi(x_l)^{\mathrm{T}}\,\Phi(x_\lambda). \end{equation*}\]

By \(U\in\mathbb{R}^{n\times q}\) we denote the matrix containing \(\Phi(x_1),\ldots,\Phi(x_n)\) as rows (by analogy with \(X\) for usual PCA in data space). Then one can show (and we will do this below) that

  • the eigenvalues of \(U^\mathrm{T}\,U\) and \(K\) coincide,

  • the eigenvectors of \(K\) contain the coordinates of \(\Phi(x_1),\ldots,\Phi(x_n)\) with respect to the PCA coordinate system in the \(q\) dimensional space.

In other words, we only have to compute eigenvalues and eigenvectors of the kernel matrix \(K\) when performing PCA for the embedded data. Size of \(K\) depends on number of samples, which might be large, but often much smaller than \(q\). Remember: for polynomial kernel of degree 2 and 1000 dimensional data (images!) we would have \(q=501501\) (see chapter on SVMs).

The first statement above requires some clarifying remarks.

  • The (up to a factor) covariance matrix \(U^\mathrm{T}\,U\) is of size \(q\times q\). Thus, PCA yields \(q\) eigenvalues and \(q\) eigenvectors. But the embedded data lives in a subspace of dimension at most \(n\). Thus, there are at most \(n\) nonzero eigenvalues for \(U^\mathrm{T}\,U\).

  • Writing \(K=U\,U^\mathrm{T}\) we see that \(K\) is a symmetric matrix. Thus, the number of nonzero eigenvalues equals the rank of \(K\) (standard math result). Because \(U\,U^\mathrm{T}\) and \(U^\mathrm{T}\,U\) have identical rank (another standard math result), \(K\) and \(U^\mathrm{T}\,U\) have the same number of nonzero eigenvalues.

  • The maximum number of nonzero eigenvalues of \(K\) and \(U^\mathrm{T}\,U\), and thus also the maximum number of nontrivial features after dimensionality reduction via kernel PCA, is \(\min\{n,q\}\).

The second statement above on the eigenvectors of \(K\) requires some clarification, too. In addition, we should formulate it more precisely.

  • The new data adapted coordinate system in \(q\) dimensions has \(q\) axes.

  • Embedded data \(\{\Phi(x_1),\ldots,\Phi(x_n)\}\) uses not more than the first \(\min\{n,q\}\) axes. Coordinates for remaining axes are zero.

  • In context of dimensionality reduction we are interested in the coordinates with respect to the first \(p\) axes only. Here \(p\) is small, in particular \(p\leq\min\{n,q\}\).

  • If \(B\in\mathbb{R}^{q\times q}\) contains the eigenvectors of \(U^\mathrm{T}\,U\) (in columns), then by analogy to usual PCA the coordinates with respect to the new coordinate system are given by

\[\begin{equation*} \tilde{U}:=U\,B. \end{equation*}\]
  • From the previous two items we see that the desired result from dimensionality reduction via kernel PCA is contained in the first \(p\) columns of \(\tilde{U}\).

  • The kernel PCA theorem states that the columns of \(\tilde{U}\) are the eigenvectors of \(K\).

Before we come to the proof, we have to think about centering embedded data. Remember: PCA is sensible for centered data only.

Centering Embedded Data#

Before applying PCA we have to center the data. Given the embedding function \(\Phi\) we have to apply PCA to data

\[\begin{equation*} \Phi(x_1)-\frac{1}{n}\,\sum_{l=1}^n\Phi(x_l)\quad,\quad\ldots\quad,\quad\Phi(x_n)-\frac{1}{n}\,\sum_{l=1}^n\Phi(x_l). \end{equation*}\]

The new embedding function

\[\begin{equation*} \Phi_\mathrm{center}(x):=\Phi(x)-\frac{1}{n}\,\sum_{l=1}^n\Phi(x_l) \end{equation*}\]

automatically yields centered embedded data.

We may rewrite this as

\[\begin{equation*} \Phi_\mathrm{center}(x):=\Phi(x)-U^\mathrm{T}\,\begin{bmatrix}\frac{1}{n}\\\vdots\\\frac{1}{n}\end{bmatrix} \end{equation*}\]

and introducing the matrix

\[\begin{equation*} M:=\begin{bmatrix}\frac{1}{n}&\cdots&\frac{1}{n}\\\vdots&&\vdots\\\frac{1}{n}&\cdots&\frac{1}{n}\end{bmatrix}\in\mathbb{R}^{n\times n} \end{equation*}\]

we obtain

\[\begin{equation*} {U_\mathrm{center}}^\mathrm{T}=U^\mathrm{T}-U^\mathrm{T}\,M=U^\mathrm{T}\,(I-M). \end{equation*}\]

The kernel corresponding to \(\Phi_\mathrm{center}\) then is

\[\begin{equation*} K_\mathrm{center}=U_\mathrm{center}\,{U_\mathrm{center}}^\mathrm{T} =(I-M)\,U\,U^\mathrm{T}\,(I-M)=(I-M)\,K\,(I-M). \end{equation*}\]

Centering the embedded data reduces to a simple transform of the kernel. If the kernel \(K\) is positive semi-definite (which is a standard assumption), then the new kernel \(K_\mathrm{center}\) is positive semi-definite, too. So centering the embedded data set does not imply any troubles. The kernel trick still works.

Passing remark: The matrix \(I-M\) is known as centering matrix. Multiplying a Matrix \(K\) from the right by \(I-M\) centers its columns (sum of all columns is zero vector). Multiplication from the left centers its rows (sum of all rows is zero vector). Applying both multiplications yields the double centered version of \(K\). The mean over all entries of a double centered matrix is zero, too.

Proof of Kernel PCA Theorem#

We assume \(n\leq q\) (which is the typical situation in practice) to avoid repreated use of \(\min\{n,q\}\) in the formulas. But the proof also works for \(n>q\) if most \(n\) are replaced by \(\min\{n,q\}\). Further we assume that \(K\) has rank \(n\), that is, all eigenvalues are nonzero. So we may write \(n\) instead of \(\mathrm{rank}\,K\) to simplify formulas. All arguments also work for rank below \(n\) if considerations are restricted to nonzero eigenvalues of \(K\).

Denoting the nonzero eigenvalues of the kernel matrix \(K\) by \(\mu_1,\ldots,\mu_n\) and corresponding eigenvectors by \(\omega_1,\ldots,\omega_n\in\mathbb{R}^n\) we define vectors \(b_1,\ldots,b_n\in\mathbb{R}^q\) by

\[\begin{equation*} b_i:=\frac{1}{\mu_i}\,\sum_{l=1}^n\omega_i^{(l)}\,\Phi(x_l),\qquad i=1,\ldots,n. \end{equation*}\]

To proof the theorem it suffices to show that

  • \(b_1,\ldots,b_n\) are eigenvectors of \(U^\mathrm{T}\,U\) and \(\mu_1,\ldots,\mu_n\) are corresponding eigenvalues,

  • \(\omega_1,\ldots,\omega_n\) contain the transformed coordinates, more precisely,

\[\begin{equation*} \omega_i=U\,b_i,\qquad i=1,\ldots,n. \end{equation*}\]

For the first statement we have to show \(U^\mathrm{T}\,U\,b_i=\mu_i\,b_i\). This follows from the definition of \(b_i\), some simple manipulations and the fact that \(\omega_i\) is an eigenvector of \(K\) to eigenvalue \(\mu_i\):

\[\begin{align*} U^\mathrm{T}\,U\,b_i &=\frac{1}{\mu_i}\,\sum_{l=1}^n\omega_i^{(l)}\,U^\mathrm{T}\,U\,\Phi(x_l) =\frac{1}{\mu_i}\,\sum_{l=1}^n\omega_i^{(l)}\,U^\mathrm{T}\,\begin{bmatrix}\Phi(x_1)^\mathrm{T}\,\Phi(x_l)\\\vdots\\\Phi(x_n)^\mathrm{T}\,\Phi(x_l)\end{bmatrix}\\ &=\frac{1}{\mu_i}\,\sum_{l=1}^n\omega_i^{(l)}\,\sum_{\lambda=1}^n\underbrace{\Phi(x_\lambda)^\mathrm{T}\,\Phi(x_l)}_{\in\mathbb{R}}\,\Phi(x_\lambda)\\ &=\frac{1}{\mu_i}\,\sum_{l=1}^n\sum_{\lambda=1}^n\omega_i^{(l)}\,\Phi(x_\lambda)^\mathrm{T}\,\Phi(x_l)\,\Phi(x_\lambda)\\ &=\frac{1}{\mu_i}\,\sum_{\lambda=1}^n\left(\sum_{l=1}^n\omega_i^{(l)}\,\Phi(x_\lambda)^\mathrm{T}\,\Phi(x_l)\right)\,\Phi(x_\lambda) =\frac{1}{\mu_i}\,\sum_{\lambda=1}^n\left(\sum_{l=1}^n K_{\lambda,l}\,\omega_i^{(l)}\right)\,\Phi(x_\lambda)\\ &=\frac{1}{\mu_i}\,\sum_{\lambda=1}^n[K\,\omega_i]_\lambda\,\Phi(x_\lambda)\\ &=\frac{1}{\mu_i}\,\sum_{\lambda=1}^n\mu_i\,\omega_i^{(\lambda)}\,\Phi(x_\lambda) =\mu_i\,b_i. \end{align*}\]

The second statement is a direct consequence of the definition of \(b_i\):

\[\begin{equation*} U\,b_i=\frac{1}{\mu_i}\,\sum_{l=1}^n\omega_i^{(l)}\,U\,\Phi(x_l) =\frac{1}{\mu_i}\,\sum_{l=1}^n\omega_i^{(l)}\,K_{\bullet,l} =\frac{1}{\mu_i}\,K\,\omega_i =\omega_i. \end{equation*}\]

Kernel PCA with Scikit-Learn#

Scikit-Learn provides the KernelPCA class in its decomposition module.

import numpy as np
import sklearn.decomposition as decomposition
import plotly.graph_objects as go
from plotly.subplots import make_subplots
data_files = ['omega.npz', 'sphere.npz', 'cube.npz', 'clouds.npz']

for file in data_files:
    
    loaded = np.load(file)
    x = loaded['x']
    y = loaded['y']
    z = loaded['z']
    red = loaded['red']
    green = loaded['green']
    blue = loaded['blue']

    kpca = decomposition.KernelPCA(2, kernel='rbf', gamma=5)
    U = kpca.fit_transform(np.stack((x, y, z), axis=1))
    
    fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'scatter3d'}, {'type': 'xy'}]])
    fig.update_layout(width=1000, height=600, scene_aspectmode='cube')
    fig.add_trace(go.Scatter3d(
        x=x, y=y, z=z,
        mode='markers',
        marker={'size': 1.5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=1)
    fig.add_trace(go.Scatter(
        x=U[:, 0], y=U[:, 1],
        mode='markers',
        marker={'size': 5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=2)
    fig.show()

Why Kernel PCA?#

Kernel PCA only yields useful results if the embedding map \(\Phi\) somehow unroles the nonlinear structure of the data set such that in higher dimensions the data set is almost linear. Kernel PCA is rarely used in practise, because one does not know how to choose the kernel. But we will see in subsequent chapters that other nonlinear dimensionality reduction techniques turn out to be equivalent to kernel PCA with specific data set adapted kernel.