Stochastic Neighbor Embedding#

Up to now we only considered deterministic dimensionality reduction methods. Stochastic neighbor embedding (SNE) does not use neighborhood relations and distances directly. Instead it estimates the probability that two samples are neighbors in high dimensions from pairwise distances. Then it tries to find points in low dimensions which yield identical neighborhood probabilities.

SNE appeared in 2002 and several variants have been developed since then. The most prominent one is known as t-SNE and originated in 2008. Like MDS, SNE does not require direct knowledge of the underlying data set \(x_1,\ldots,x_n\), but only uses pairwise distances.

Related projects:

Basic idea#

Probabilities in High Dimensions#

Given pairwise distances \(D_{l,\lambda}\) (\(l,\lambda=1,\ldots,n\)) in high dimensions we may fix a sample \(x_l\) and assign to all other samples probabilities \(\tilde{p}_{l,\lambda}\) reflecting the neighborhood relations to \(x_\lambda\). The closer \(x_\lambda\) to \(x_l\) the higher \(\tilde{p}_{l,\lambda}\). We may use Gaussian probabilities:

\[\begin{equation*} \tilde{p}_{l,\lambda}:=\frac{\mathrm{e}^\frac{-(x_l-x_\lambda)^2}{2\,\sigma_l^2}}{\sum\limits_{\substack{i=1\\i\neq l}}^n\mathrm{e}^\frac{-(x_l-x_i)^2}{2\,\sigma_l^2}},\quad\lambda\neq l\qquad\text{and}\qquad\tilde{p}_{l,l}:=0. \end{equation*}\]

The parameter \(\sigma_l\) is chosen numerically (by bisection, for instance) to fix the entropy of the neighborhood distribution of \(x_l\) at some prescribed value, which is independent of \(l\). The entropy here is

\[\begin{equation*} -\sum_{\lambda=1}^n \tilde{p}_{l,\lambda}\,\log\tilde{p}_{l,\lambda}. \end{equation*}\]

If \(\sigma_l\) is too small there will be only few neighbors of \(x_l\) with high probabilities. Then entropy is very low. If \(\sigma_l\) is too large many neighbors of \(x_l\) will have similar probabilities. Then entropy is high. Adjusting \(\sigma_l\) to get some prescibed medium entropy ensures that the probability distribution for the neighbors takes the data set’s local density into account.

In general \(\tilde{p}_{l,\lambda}\neq\tilde{p}_{\lambda,l}\). To enforce symmetry (which simplifies some computations) we set

\[\begin{equation*} p_{l,\lambda}:=\frac{\tilde{p}_{l,\lambda}+\tilde{p}_{\lambda,l}}{2\,n}. \end{equation*}\]

The sum of all these \(n^2\) values equals \(1\). Instead of \(n\) probability distributions (one for each \(l\)) we now only have one distribution and this distribution is symmetric.

All in all we converted pairwise distances to pairwise probabilities that two samples are neighbors. But the conversions is not direct by proportionality, but also takes local density of the data set into account.

Probabilities in Low Dimensions#

Given points \(u_1,\ldots,u_n\) in low dimensions we may use the same construction as in high dimensions to obtain probabilities \(q_{l,\lambda}\). Distances are Euclidean and \(\sigma\)-values can be set to one to get uniform local densities in low dimensions.

There exist different involved reasons to choose non-Gaussian probabilities in low dimensions. Several choices have been proposed yielding a range of different SNE variants. Below we give the details for a variant known as t-SNE.

Fitting Probabilites#

SNE tries to find low dimensional points \(u_1,\ldots,u_n\) such that the corresponding probability distribution fits the high dimensional probability distribution as good as possible. Instead of using MSE of both sets of probability values SNE prefers the Kullback-Leibler divergence:

\[\begin{equation*} \sum_{l=2}^n\sum_{\lambda=1}^{l-1}p_{l,\lambda}\,\log\frac{p_{l,\lambda}}{q_{l,\lambda}(u_1,\ldots,u_n)}\to\min_{u_1,\ldots,u_n}. \end{equation*}\]

This minimization problem can be solved numerically via gradient descent. There also exist some more efficient methods adapted to the specifics of SNE.

t-SNE#

The t-SNE variant of SNE defines low dimensional probabilites \(q_{l,\lambda}\) based on the Student’s t-distribution. There are two reasons for this choice:

  • It’s computationally more efficient because no exponentiation is required.

  • Student’s t-distribution decays slower than a Gaussian distribution, which compensates (to some degree) for effects caused by the curse of dimensionality. A ball around a sample in high dimensions has much higher volume than a same sized ball (same radius) in low dimensions. To get similar sample densities (neighbors per volume) in both high and low dimensions in low dimensions we have to assign higher probabilities to more distant samples than in high dimensions. Else the lower dimensional embedding would look much denser than the original data set and clusters may get indistinguishable.

neighbors per volume in high and low dimensions

Fig. 73 Identical number of neighbors yields higher neighbor density in low dimensions than in high dimensions.#

t-SNE with Scikit-Learn#

Scikit-Learn has the TSNE class in the manifold module. The perplexity parameter controls the desired entropy. Entropy is the base 2 logarithm of perplexity.

import numpy as np
import sklearn.manifold as manifold
import plotly.graph_objects as go
from plotly.subplots import make_subplots
data_files = ['omega.npz', 'sphere.npz', 'cube.npz', 'clouds.npz']

for file in data_files:
    
    loaded = np.load(file)
    x = loaded['x']
    y = loaded['y']
    z = loaded['z']
    red = loaded['red']
    green = loaded['green']
    blue = loaded['blue']

    sne = manifold.TSNE(n_components=2, perplexity=30)
    U = sne.fit_transform(np.stack((x, y, z), axis=1))
    
    fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'scatter3d'}, {'type': 'xy'}]])
    fig.update_layout(width=1000, height=600, scene_aspectmode='cube')
    fig.add_trace(go.Scatter3d(
        x=x, y=y, z=z,
        mode='markers',
        marker={'size': 1.5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=1)
    fig.add_trace(go.Scatter(
        x=U[:, 0], y=U[:, 1],
        mode='markers',
        marker={'size': 5, 'color': [f'rgb({r},{g},{b})' for r, g, b in zip(red, green, blue)]},
        hoverinfo = 'none',
        showlegend=False
    ), row=1, col=2)
    fig.show()