Kernel SVMs#

SVMs as introduced above only yield linear classifiers. With some simple modification, known as the kernel trick, we may extend soft margin SVMs to nonlinear classification, where the decision boundary is defined by some nonlinear function instead of a hyperplane.

Feature Transforms#

When considering linear regression we applied functions \(\varphi_1,\ldots,\varphi_\mu:\mathbb{R}^m\rightarrow\mathbb{R}\) to the feature values and then applied linear regression with linear functions to the transformed features to obtain nonlinear models. Exactly the same idea applies to SVMs. Given a vector-valued function \(\varphi:\mathbb{R}^m\rightarrow\mathbb{R}^\mu\) with \(\mu>m\) we transform all inputs \(x\) to \(\varphi(x)\) and train a linear SVM classifier in \(\mathbb{R}^\mu\).

Example

For \(m=3\) polynomial features of degree 2 are given by

\[\begin{equation*} \varphi(x)=\left(\tfrac{1}{\sqrt{2}},\;x^{(1)},\;x^{(2)},\;x^{(3)},\;\tfrac{1}{\sqrt{2}}\,\left(x^{(1)}\right)^2,\;\tfrac{1}{\sqrt{2}}\,\left(x^{(2)}\right)^2,\;\tfrac{1}{\sqrt{2}}\,\left(x^{(3)}\right)^2,\;x^{(1)}\,x^{(2)},\;x^{(1)}\,x^{(3)},\;x^{(2)}\,x^{(3)}\right). \end{equation*}\]

Why we use \(\sqrt{2}\) here will become clear below.

feature transform example

Fig. 55 Data set and margin in original and in transformed space.#

Kernels#

Training and prediction with soft margin SVMs do not use the feature values directly but only inner products of the inputs. Thus, the transform \(\varphi\) only appears in expressions

\[\begin{equation*} K(x,\tilde{x}):=\varphi(x)^\mathrm{T}\,\varphi(\tilde{x}). \end{equation*}\]

Given some feature transform \(\varphi\) corresponding function \(K:\mathbb{R}^m\times\mathbb{R}^m\rightarrow\mathbb{R}\) is called a kernel.

Kernels can be interpreted as similarity measures because \(K\) attains its maximum for \(x=\tilde{x}\) and \(K\) is zero if \(\varphi(x)\) is orthogonal to \(\varphi(\tilde{x})\).

Example

For \(m=3\) polynomial features of degree 2 as above yield the kernel

\[\begin{equation*} K(x,\tilde{x})=\tfrac{1}{2}\,\left(x^\mathrm{T}\,\tilde{x}+1\right)^2. \end{equation*}\]

The \(\sqrt{2}\) in the transform \(\varphi\) ensures that we get such a simple expression for the inner product of two transformed feature vectors.

Working with kernels instead of feature transforms is much more efficient. In the above example computing \(\varphi(x)^\mathrm{T}\,\varphi(\tilde{x})\) requires two feature transforms and an inner product in \(\mathbb{R}^{10}\). Computing \(K(x,\tilde{x})\) only requires an inner product in \(\mathbb{R}^3\) plus one addition and one multiplication in \(\mathbb{R}\).

For general \(m\) and polynomial features of degree 2 we would have \(\mu=\frac{m\,(m+1)}{2}+m+1\). For \(m=1000\) this yields \(\mu=501501\). Thus, computing inner products in \(\mathbb{R}^\mu\) is much more expensive than computing inner products in \(\mathbb{R}^m\). The kernel trick allows for working with feature transforms without additional computational efforts.

More Kernels#

Next to inhomogeneous polynomial kernels

\[\begin{equation*} (x^\mathrm{T}\,\tilde{x}+1)^p\qquad\text{with some $p\in\mathbb{N}$}. \end{equation*}\]

and homogeneous polynomial kernels

\[\begin{equation*} (x^\mathrm{T}\,\tilde{x})^p\qquad\text{with some $p\in\mathbb{N}$}. \end{equation*}\]

there are several other kernels used in practise. The most important one is the Gaussian kernel

\[\begin{equation*} \mathrm{e}^{-\gamma\,|x-\tilde{x}|^2}\qquad\text{with some $\gamma>0$}, \end{equation*}\]

also known as radial basis function (RBF) kernel. Deriving corresponding feature transform requires some advanced math, because the feature transform maps inputs into an infinite dimensional space. Gaussian kernel can be interpreted as an inhomogeneous polynomial kernel of infinite degree.

For fixed \(\tilde{x}\) and \(m=2\) the RBF kernel has a bell shaped graph with the bell centered at \(\tilde{x}\). Predictions of an SVM for inputs \(x\) are the signs of

\[\begin{equation*} \sum_{l=1}^n c_l\,y_l\,K(x_l,x)+\text{constant}. \end{equation*}\]

This function is a weighted sum of bells centered at \(x_1,\ldots,x_n\) with weights zero for non-support vectors \(x_l\).