Multidimensional Scaling#

We already considered techniques for dimensionality reduction in the context of feature reduction. There the aim was to construct more expressive features and to remove unnecessary ones. Now the focus is on visualization, that is, reduction of data sets to 2 or 3 dimensions. This way me may get a better understanding of a data set and we also may check results obtained from clustering methods.

We already met two techniques for dimensionality reduction:

  • (Kernel) principal component analysis (PCA), which simply projects a data set orthogonally onto a lower dimensional linear manifold,

  • Autoencoders, which try to find two mappings between the high and a low dimensional space such that their composition resamples the data set in the high dimensional space as good as possible.

PCA is a linear dimensionality reduction technique, because it performs a linear transform (multiplication by matrix). Kernel PCA and autoencoders are nonlinear dimensionality reduction techniques, because the mapping from high to low dimensions cannot be expressed by matrix multiplication.

Multidimensional scaling (MDS) refers to a third fundamental concept for dimensionality reduction. Here we try to find a set of points in low dimensions such that pairwise distances are as close as possible to pairwise distances in the high dimensional space. MDS comes in several variants differing in the choice of the distance measure and also in the weighting of pairwise distances.

Related projects:

Abstract Mathematical Formulation#

To make things precise, denote by \((x_1,\ldots,x_n)\) the data set in the high dimensional space \(\mathbb{R}^m\) and by \((u_1,\ldots,u_n)\) a set of points in a lower dimensional space \(\mathbb{R}^p\). Note that both sets are of equal size \(n\). Further let \(d_m\) and \(d_p\) be similarity measures in high and low dimensions, respectively. By \(w_{l,\lambda}\in(0,\infty)\), \(l,\lambda=1,\ldots,n\) we denote some weights. Then MDS aims at solving

\[\begin{equation*} \sum_{l=1}^n\sum_{\lambda=1}^n w_{l,\lambda}\,\bigl(d_m(x_l,x_\lambda)-d_p(u_l,u_\lambda)\bigr)^2\to\min_{u_1,\ldots,u_n}. \end{equation*}\]

Squaring is somewhat arbitrary here. We could apply any positive and increasing function to the difference of pairwise distances. But in each conrete MDS variant the square is the most fortunate choice, because it ensures differentiability and simplifies computation of gradients.

MDS solely relies on similarities \(d_m(x_l,x_\lambda)\) between samples and does not touch samples directly. This observation allows for applications with arbitrary types of data (text data, for instance) as long as there is some notion of similarity. In addition, the similarity measure in high dimensions is not required to be some precise mathematical construct. Human scoring is possible, too, for instance.

Metric MDS and Sammon’s Method#

In metric MDS the low dimensional similarity measure is the squared Euclidean distance

\[\begin{equation*} d_p(u_l,u_\lambda)=|u_l-u_\lambda|^2. \end{equation*}\]

So metric MDS tries to preserve pairwise Euclidean distances (if distances in high dimensions are Euclidean, too).

The minimization problem has to be solved numerically by gradient descent or Newton-type methods (iterative methods using second order derivatives). Results may be inaccurate due to local minima or saddle points. Starting guess may be determined by PCA or chosen at random.

Without weights (all weights set to 1) large distances will dominate the objective, because an error in distance fitting of 10 per cent has more influence on the objective for large distances than for small distances. Thus, without weights the local structure of the data set is not well reconstructed in low dimensions.

Usually one is more interested in the local structure than in the global one. For instance, the shape of a cluster or the boundary region between closely spaced clusters are more interesting than the correct distance between clusters far apart from each other.

Weighting by inverse distances solves this issue and puts emphasis on local structures. Typically, weights are

\[\begin{equation*} w_{l,\lambda}=\frac{1}{d_m(x_l,x_\lambda)}. \end{equation*}\]

This weighted variant of metric MDS is known as Sammon’s method.

import numpy as np
import sklearn.manifold as manifold
import plotly.graph_objects as go
from plotly.subplots import make_subplots
data_files = ['omega.npz', 'sphere.npz', 'cube.npz', 'clouds.npz']

for file in data_files:
    
    loaded = np.load(file)
    x = loaded['x']
    y = loaded['y']
    z = loaded['z']
    red = loaded['red']
    green = loaded['green']
    blue = loaded['blue']

    mds = manifold.MDS(2, normalized_stress='auto')
    U = mds.fit_transform(np.stack((x, y, z), axis=1))
    
    fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'scatter3d'}, {'type': 'xy'}]])
    fig.update_layout(width=1000, height=600, scene_aspectmode='cube')
    fig.add_trace(go.Scatter3d(
        x=x, y=y, z=z,
        mode='markers',
        marker={'size': 1.5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=1)
    fig.add_trace(go.Scatter(
        x=U[:, 0], y=U[:, 1],
        mode='markers',
        marker={'size': 5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=2)
    fig.show()

Classical MDS#

Like metric MDS classical MDS aims at fitting Euclidean distances, but avoids iterative minimization. We will see that applying a linear transform to the Euclidean distance matrices in both low and high dimensions before fitting allows for analytical minimization. Here, by Euclidean distance matrix we denote a matrix containing in row \(i\) and column \(j\) the squared Euclidean distance between sample \(i\) and sample \(j\).

We will discuss several aspects step by step:

  • the transform applied to Euclidean distances to simplify minimization,

  • reconstruction of the (lower dimensional) data set from transformed distances

  • formulation and solution of the minimization problem,

  • modifications to allow for non-Euclidean distances in high dimensions,

  • relations to PCA and kernel PCA.

From Distances to Inner Products#

Let \(z_1,\ldots,z_n\in\mathbb{R}^q\) be a collection of points in \(q\) dimensions, think of \(q=m\) with \(z_l=x_l\) or \(q=p\) with \(z_l=u_l\). Corresponding Euclidean distance matrix \(D\in\mathbb{R}^{n\times n}\) is given by

\[\begin{equation*} D_{l,\lambda}:=|z_l-z_\lambda|^2 \end{equation*}\]

and the inner product matrix \(S\in\mathbb{R}^n\) of the centered data set \(z_1-\overline{z},\ldots,z_n-\bar{z}\) with \(\bar{z}:=\frac{1}{n}\,\sum_{l=1}^n z_l\) is defined by

\[\begin{equation*} S_{l,\lambda}:=(z_l-\bar{z})^\mathrm{T}\,(z_\lambda-\bar{z}). \end{equation*}\]

With

\[\begin{equation*} M:=\begin{bmatrix}\frac{1}{n}&\cdots&\frac{1}{n}\\\vdots&&\vdots\\\frac{1}{n}&\cdots&\frac{1}{n}\end{bmatrix}\in\mathbb{R}^{n\times n} \end{equation*}\]

we now show

\[\begin{equation*} S=-\frac{1}{2}\,(I-M)\,D\,(I-M), \end{equation*}\]

that is, double centering the Euclidean distance matrix yields the inner product matrix of the centered data set (up to some constant factor).

From the definition of matrix multiplication and then from

\[\begin{equation*} |z_l-z_\lambda|^2=|z_l|^2+|z_\lambda|^2-2\,{z_l}^\mathrm{T}\,z_\lambda \end{equation*}\]

we see

\[\begin{align*} [(I-M)\,D\,(I-M)]_{l,\lambda} &=D_{l,\lambda}-[M\,D]_{l,\lambda}-[D\,M]_{l,\lambda}+[M\,D\,M]_{l,\lambda}\\ &=D_{l,\lambda}-\frac{1}{n}\,\sum_{i=1}^n D_{i,\lambda}-\frac{1}{n}\,\sum_{j=1}^n D_{l,j}+\frac{1}{n^2}\,\sum_{i=1}^n\sum_{j=1}^n D_{i,j}\\ &=|z_l|^2+|z_\lambda|^2-2\,{z_l}^\mathrm{T}\,z_\lambda-\left(|z_\lambda|^2+\frac{1}{n}\,\sum_{i=1}^n\bigl(|z_i|^2-2\,{z_i}^\mathrm{T}\,z_\lambda\bigr)\right)\\ &\qquad-\left(|z_l|^2+\frac{1}{n}\,\sum_{j=1}^n\bigl(|z_j|^2-2\,{z_l}^\mathrm{T}\,z_j\bigr)\right)+\frac{1}{n^2}\,\sum_{i=1}^n\sum_{j=1}^n\bigl(|z_i|^2+|z_j|^2-2\,{z_i}^\mathrm{T}\,z_j\bigr). \end{align*}\]

Summands \(|z_l|^2\) and \(|z_\lambda|^2\) each appear twice with different sign, thus vanish. The double sum can be simplified to

\[\begin{align*} \frac{1}{n^2}\,\sum_{i=1}^n\sum_{j=1}^n\bigl(|z_i|^2+|z_j|^2-2\,{z_i}^\mathrm{T}\,z_j\bigr) &=\frac{1}{n^2}\,\sum_{i=1}^n\sum_{j=1}^n|z_i|^2+\frac{1}{n^2}\,\sum_{i=1}^n\sum_{j=1}^n|z_j|^2-\frac{2}{n^2}\,\sum_{i=1}^n\sum_{j=1}^n\,{z_i}^\mathrm{T}\,z_j\\ &=\frac{1}{n}\,\sum_{i=1}^n|z_i|^2+\frac{1}{n}\,\sum_{j=1}^n|z_j|^2-2\,\left(\frac{1}{n}\,\sum_{i=1}^n z_i\right)^\mathrm{T}\,\left(\frac{1}{n}\,\sum_{j=1}^n z_j\right), \end{align*}\]

so that the sums over \(|z_i|^2\) and \(|z_j|^2\) cancel out, too. Thus, we have

\[\begin{align*} [(I-M)\,D\,(I-M)]_{l,\lambda} &=-2\,{z_l}^\mathrm{T}\,z_\lambda+\frac{2}{n}\,\sum_{i=1}^n{z_i}^\mathrm{T}\,z_\lambda+\frac{2}{n}\,\sum_{j=1}^n{z_l}^\mathrm{T}\,z_j-2\,\bar{z}^\mathrm{T}\,\bar{z}\\ &=-2\,{z_l}^\mathrm{T}\,z_\lambda+2\,\bar{z}^\mathrm{T}\,z_\lambda+2\,{z_l}^\mathrm{T}\,\bar{z}-2\,\bar{z}^\mathrm{T}\,\bar{z}\\ &=-2\,(z_l-\bar{z})^\mathrm{T}\,(z_\lambda-\bar{z})=-2\,S_{l,\lambda}. \end{align*}\]

From Inner Products to Data#

From an inner product matrix \(S\) we want to reconstruct original data \(z_1,\ldots,z_n\). As we will see, the mean \(\bar{z}\) cannot be reconstructed from \(S\). Choosing an arbitrary \(\bar{z}\) amounts in a translation of the whole data compared to the original one. In addition there will be some freedom in rotating and mirrowing the reconstructed data set.

The data set can be reconstructed from the eigendecomposition of \(S\). The matrix \(S\) is symmetric and positive semi-definite. So there exist \(n\) nonnegative numbers \(\lambda_1\geq\ldots\geq\lambda_n\geq 0\) (eigenvalues of \(S\)) and \(n\) vectors \(a_1,\ldots,a_n\in\mathbb{R}^n\) (eigenvectors of \(S\)) with

\[\begin{equation*} S=A\,\Lambda\,A^\mathrm{T}, \end{equation*}\]

where \(A\) contains the eigenvectors as columns and \(\Lambda\) is the diagonal matrix of eigenvalues.

If \(Z\in\mathbb{R}^{n\times q}\) contains \(z_1,\ldots,z_n\) as rows, then we have

\[\begin{equation*} S=\bigl((I-M)\,Z\bigr)\,\bigl((I-M)\,Z\bigr)^\mathrm{T}=(I-M)\,Z\,Z^\mathrm{T}\,(I-M). \end{equation*}\]

The rank of \(Z\) is at most \(q\) and, thus, the rank of \(S\) is at most \(q\), too. So all but the first \(q\) eigenvalues of \(S\) are zero. Here we assume \(q\leq n\). The case \(q>n\) is not of interest to us, because for MDS we will have \(q=p\) (dimension of low dimensional space, in practice 2 or 3).

With an arbitrary orthonormal matrix \(R\in\mathbb{R}^{q\times q}\) (rotation and/or mirrowing) we set

\[\begin{equation*} \tilde{z}_l:=R\,\begin{bmatrix}\sqrt{\lambda_1}\,a_1^{(l)}\\\vdots\\\sqrt{\lambda_q}\,a_q^{(l)}\end{bmatrix}\qquad\text{for }l=1,\ldots,n. \end{equation*}\]

Denoting the diagonal matrix of square roots of the eigenvalues by \(\lambda^{\frac{1}{2}}\) we see

\[\begin{equation*} \tilde{Z}=A\,\Lambda^{\frac{1}{2}}\,R^\mathrm{T} \end{equation*}\]

and

\[\begin{equation*} \tilde{Z}\,\tilde{Z}^\mathrm{T}=A\,\Lambda^{\frac{1}{2}}\,R^\mathrm{T}\,R\,\Lambda^{\frac{1}{2}}\,A^\mathrm{T} =A\,\Lambda\,A^\mathrm{T}=S. \end{equation*}\]

That is, \(S\) is the inner product matrix of \(\tilde{z}_1,\ldots,\tilde{z}_n\).

Now from

\[\begin{equation*} |\tilde{z}_l-\tilde{z}_\lambda|^2 =|\tilde{z}_l|^2+|\tilde{z}_\lambda|^2-2\,{\tilde{z}_l}^\mathrm{T}\,\tilde{z}_\lambda =S_{l,l}+S_{\lambda,\lambda}-2\,S_{l,\lambda} =|z_l-\bar{z}-(z_\lambda-\bar{z})|^2=|z_l-z_\lambda|^2 \end{equation*}\]

we see that the reconstructed data set \(\tilde{z}_1,\ldots,\tilde{z}_n\) and the original data set \(z_1,\ldots,z_n\) have identical distance matrices.

The Minimization Problem#

Classical MDS fits inner products, not distances. The following steps lead from a Euclidean distance matrix \(D\in\mathbb{R}^{n\times n}\) in high dimensions to a set of points \(u_1,\ldots,u_n\in\mathbb{R}^p\) in low dimensions:

  • Calculate the inner products matrix \(S\in\mathbb{R}^{n\times n}\) in high dimensions:

\[\begin{equation*} S=-\frac{1}{2}\,(I-M)\,D\,(I-M). \end{equation*}\]

Here we do not have to know the underlying data set \(x_1,\ldots,x_n\in\mathbb{R}^m\) explicitly, because inner products are determined by pairwise Euclidean distances (up to centering).

  • Solve

\[\begin{equation*} \sum_{l=1}^n\sum_{\lambda=1}^n(S_{l,\lambda}-T_{l,\lambda})^2\to\min_{T\in\mathbb{R}^{n\times n}}\qquad\text{considering only symmetric $T$ of rank $p$.} \end{equation*}\]
  • Interpret the optimal \(T\) as inner product matrix of \(n\) points in \(\mathbb{R}^p\) and reconstruct corresponding points \(u_1,\ldots,u_n\).

In contrast to metric MDS classical MDS does not fit pairwise distances directly, but transforms distances to inner products and fits those inner products. Distance matrices and inner product matrices carry identical information (up to translation, rotation, mirrowing of the data set), but due to different objective functions both variants of MDS yield different results. Metric MDS minimizes the means squared error (MSE) of pairwise distances:

\[\begin{equation*} \sum_{l=1}^n\sum_{\lambda=1}^n\bigl(|x_l-x_\lambda|^2-|u_l-u_\lambda|^2\bigr)^2. \end{equation*}\]

Classical MDS minimizes (assuming centered data in high dimensions):

\[\begin{equation*} \sum_{l=1}^n\bigl(|x_l|^2-|u_l|^2\bigr)^2+\sum_{l=1}^n\sum_{\substack{\lambda=1\\\lambda\neq l}}^n\bigl({x_l}^\mathrm{T}\,x_\lambda-{u_l}^\mathrm{T}\,u_\lambda\bigr)^2, \end{equation*}\]

which is the MSE of vetor lengths (diagonal of inner product matrix) plus the MSE of angles (inner products are cosines of angles multiplied by both vector lengths).

It remains to answer the question how to solve the above minimization problem. We aim at a set of points in \(\mathbb{R}^p\). So corresponding inner product matrix has rank of at most \(p\). That’s the reason why we restrict optimization to symmetric matrices of rank \(p\). A standard result from linear algebra (Eckart-Young-Mirsky theorem) tells us, that we have to look at the eigendecomposition of \(S\):

\[\begin{equation*} S=A\,\Lambda\,A^\mathrm{T}\qquad\text{cf. above}. \end{equation*}\]

Then the optimal \(T\) is (assuming \(p\leq n\))

\[\begin{equation*} T=B\,\Theta\,B^\mathrm{T}, \end{equation*}\]

where \(B\in\mathbb{R}^{n\times p}\) contains the first \(p\) eigenvectors of \(S\) as columns and \(\Theta\in\mathbb{R}^{p\times p}\) is the diagonal matrix of the highest \(p\) eigenvalues of \(S\). So solving the minimization problem reduces to an eigenvalue problem for \(S\).

As a by-product of minimization we get the eigendecomposition of \(T\), which is required for reconstructing the data set \(u_1,\ldots,u_n\) in low dimensions from \(T\). Eigenvalues and eigenvectors of \(T\) coincide with the first \(p\) eigenvalues and eigenvectors of \(S\).

Non-Euclidean Distances#

Classical MDS solely relies on the distance matrix \(D\) in high dimensions. The assumption that distances are Euclidean ensures that the transformed distance matrix \(S=-\frac{1}{2}\,(I-M)\,D\,(I-M)\) is symmetric and positive semi-definite. So we may replace the assumption of Euclidean distances by the more direct assumption that the transformed distance matrix \(S\) is symmetric and positive semi-definite.

\(D\) may contain arbitrary (non-Euclidean) pairwise distances as long as \(-\frac{1}{2}\,D\) is symmetric and positive semi-definite. So it can be regarded as a kernel matrix resulting from inner products of nonlinearly transformed data. The transform from \(-\frac{1}{2}\,D\) to \(S\) is simple double centering, which is equivalent to centering the transformed data set.

We may even drop the assumption that \(S\) has to be positive semi-definite. If \(S\) is non-definite, some eigenvalues will be negative. When approximating \(S\) by some \(T\) of rank \(p\) we only consider the \(p\) largest positive eigenvalues. As long as the absolute values of negative eigenvalues is small, the error induced by ignoring them is not higher than the error resulting from low-rank approximation (that is, from dropping small positive eigenvalues). If \(D\) is based on some kind of distance, possible negative eigenvalues will be small.

As long as \(D\) is symmetric and is related to some almost arbitrary kind of distance classical MDS is applicable.

Relation to PCA and Kernel PCA#

For centered data and Euclidean distances we have \(S=X\,X^\mathrm{T}\). Else, \(S\) can be regarded as a double centered kernel matrix originating from a kernel matrix \(-\frac{1}{2}\,D\). The transform from \(S\) to \(u_1,\ldots,u_n\) in classical MDS coincides with the transform from \(K\) to the data set’s first \(p\) coordinates w.r.t. to the kernel PCA coordinate system. In both cases we use the same parts of the eigendecomposition of \(S\) or \(K\), respectively.

Classical MDS and kernel PCA are two different motivations for one and the same algorithm. In case of centered data and Euclidean distances classical MDS and (non-kernel) PCA coincide. Classical Euclidean MDS originated in the 1950s. Kernel PCA appeared in 1996 and the connection to classical MDS had been discovered a few years later.

Non-Metric MDS#

For the sake of completeness we mention a third variant of MDS known as non-metric MDS, but we do not go into the details. Non-metric MDS does not focus on getting the distances right, but only on preserving the ordering of distances.

The similarity measure in low dimensions is

\[\begin{equation*} d(u_l,u_\lambda)=f\bigl(|u_l-u_\lambda|^2\bigr). \end{equation*}\]

with some monotonically increasing function \(f\).

Next to \(u_1,\ldots,u_n\) the function \(f\) is a variable in the optimization process. A typical numerical approach is to alternate optimization steps for \(u_1,\ldots,u_n\) and \(f\). Concrete algorithms following this idea are smallest space analysis (SSA) and Shepard-Kruskal algorithm (German Wikipedia).

Isomap#

Isomap assumes that the data set lies on low dimensional manifold in the high dimensional space. Instead of Euclidean distances it calculates (approximate) geodesic distances with respect to the manifold. In other words, it looks for the shortest distance between two points if a traveller between both points is not allowed to leave the manifold. Pairwise geodesic distances at hand classical MDS is applied to the distance matrix.

The only thing we have to discuss is how to find (approximate) geodesic distances. This requires two steps:

  • Create a neighborhood graph. The nodes are the samples. A node is connected to another node by an (undirected) edge if the other node belongs to the \(k\) nearest neighbors of the first one for some prescribed \(k\). Each edge is assigned a weight, which equals the Euclidean distance between samples connected by the edge.

  • Get length of shortest path in neighborhood graph between each pair of nodes. Length of a path is the sum of all edge weights belonging to the path. Finding shortest paths in a graph is a standard task and can be solved by Dijkstra’s algorithm.

neighborhood graph and one shortest path for Isomap

Fig. 69 Isomap uses approximate geodesic distances instead of Euclidean distances.#

An alternative to the \(k\) nearest neighbors is to connect a node to all other nodes within a prescribed distance.

Next to high computational costs a major problem with Isomap is that there may be so called short-circuit errors. That is, the neighborhood graph contains edges betwenn non-neighboring points on the manifold. This happens especially for large \(k\) or sparse data sets.

short-circuit error

Fig. 70 Isomap may suffer from short-circuit errors.#

Scikit-Learn has the Isomap class in the manifold module.

data_files = ['omega.npz', 'sphere.npz', 'cube.npz', 'clouds.npz']

for file in data_files:
    
    loaded = np.load(file)
    x = loaded['x']
    y = loaded['y']
    z = loaded['z']
    red = loaded['red']
    green = loaded['green']
    blue = loaded['blue']

    im = manifold.Isomap(n_components=2, n_neighbors=5)
    U = im.fit_transform(np.stack((x, y, z), axis=1))
    
    fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'scatter3d'}, {'type': 'xy'}]])
    fig.update_layout(width=1000, height=600, scene_aspectmode='cube')
    fig.add_trace(go.Scatter3d(
        x=x, y=y, z=z,
        mode='markers',
        marker={'size': 1.5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=1)
    fig.add_trace(go.Scatter(
        x=U[:, 0], y=U[:, 1],
        mode='markers',
        marker={'size': 5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=2)
    fig.show()
/home/jef19jdw/anaconda3/envs/ds_book/lib/python3.10/site-packages/sklearn/manifold/_isomap.py:373: UserWarning:

The number of connected components of the neighbors graph is 2 > 1. Completing the graph to fit Isomap might be slow. Increase the number of neighbors to avoid this issue.

/home/jef19jdw/anaconda3/envs/ds_book/lib/python3.10/site-packages/scipy/sparse/_index.py:103: SparseEfficiencyWarning:

Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.