Soft Margin SVMs#

Hard margin SVMs only work if the two classes are linearly separable. This is rarely seen in practise because some samples might be misslabeled or we do not have enough information to obtain clearly separated classes. In such cases the quadratic optimization problem has no feasible point and, thus, no solution.

From Constraints to Loss Functions#

To overcome non-existence of solutions we may relax the constraints. Instead of requiring all constraints to be fully satisfied, we could measure the violation of constraints. Then we have two objectives: minimize constraint violation and maximize margin. Remember that maximizing the margin is equivalent to minimizing the length of the separating hyperplane’s normal vector. So we have to solve two minimization problems at once. The standard approach is to minimize a weighted sum of both objective functions:

\[\begin{equation*} \text{measure for contraint violation}+\alpha\,|a|^2\to\min_{a,b}. \end{equation*}\]

The parameter \(\alpha\) controls the trade-off between satisfaction of constraints and size of the margin. Small \(\alpha\) yields well satisfied constraints (almost all samples on the correct side of the margin) but small margin (large \(|a|\)). Large \(\alpha\) leads to a wide margin but violated constraints (incorrect predictions on training set).

different parameters for soft margin SVM

Fig. 53 Margin width and separation quality depend on the parameter \(\alpha\).#

The measure for constraint violation is a typical loss functions, because for SVMs constraint violation is equivalent to missclassification. A loss function measures the distance between a model’s predictions and true labels. An SVM model’s prediction for input \(x\) is the sign of \(a^{\mathrm{T}}\,x+b\), but the value \(a^{\mathrm{T}}\,x+b\) carries more information than just the sign (that is, the predicted class): if \(|a^{\mathrm{T}}\,x+b|\) is large the prediction is very reliable, if it is small the sample is very close to the decision boundary. We may interpret \(a^{\mathrm{T}}\,x+b\) as a score and, thus, as the model’s prediction.

For a given loss function \(L:\mathbb{R}\times\{-1,1\}\rightarrow\mathbb{R}\) we want to solve

\[\begin{equation*} \frac{1}{n}\,\sum_{l=1}^n L\bigl(a^{\mathrm{T}}\,x_l+b,y_l\bigr)+\alpha\,|a|^2\to\min_{a,b}. \end{equation*}\]

The loss function \(L\) should be zero if and only if \(x_l\) is on the correct side of the margin, that is, if and only if \(y_l\,(a^{\mathrm{T}}\,x_l+b)\geq 1\). If \(x_l\) is not on the correct side (on the wrong side or inside margin), then \(L\) should be the larger the farther away \(x_l\) is from the correct side. In this case \(1-y_l\,(a^{\mathrm{T}}\,x_l+b)\) is a reasonable choice. Both cases can be expressed in one formula:

\[\begin{equation*} L(z,y):=\max\{0,1-y\,z\}. \end{equation*}\]

This loss function is known as hinge loss.

The minimization problem of soft margin SVMs now reads

\[\begin{equation*} \boxed{\frac{1}{n}\,\sum_{l=1}^n\max\bigl\{0,1-y_l\,(a^{\mathrm{T}}\,x_l+b)\bigr\}+\alpha\,|a|^2\to\min_{a,b}.} \end{equation*}\]

Soft margin SVMs still try to maximize the margin between both classes, but some samples are allowed to lie inside the margin or even on the wrong side. So the margin is not a hard one, but in some sense soft.

Quadratic Optimization#

The minimization problem above is not differentiable, but convex. There are several efficient algorithms for approximating the minimizer (subgradient descent). Alternatively we may rewrite it as a quadratic optimization problem. For this purpose we start with the quadratic hard margin problem

\[\begin{equation*} |a|^2\to\min_{a\in\mathbb{R}^m}\qquad\text{with constraints}\quad y_l\,(a^{\mathrm{T}}\,x_l+b)\geq 1 \end{equation*}\]

and introduce \(n\) additional variables \(s_1,\ldots,s_n\) (sometimes called slack variables) expressing the violation of the hard margin constraints. Instead of the hard margin constraints we require

\[\begin{equation*} y_l\,(a^{\mathrm{T}}\,x_l+b)\geq 1-s_l\qquad\text{and}\qquad s_l\geq 0. \end{equation*}\]

Additional nonnegativity constraints ensure that satisfied hard margin constraints always yield a violation of zero (instead of negative violation). Minimal constraint violation can be reached by minimizing the sum of all \(s_l\). Because we want to minimize \(|a|\), too, we minimize a weighted sum

\[\begin{equation*} \boxed{\frac{1}{n}\,\sum_{l=1}^n s_l+\alpha\,|a|^2\to\min_{s,a,b}\qquad\text{with constraints}\quad y_l\,(a^{\mathrm{T}}\,x_l+b)\geq 1-s_l,\quad s_l\geq 0.} \end{equation*}\]

For each \(a\) and \(b\) constraints can be satisfied by choosing \(s_1,\ldots,s_n\) large enough. The smallest feasible \(s_l\) is

\[\begin{equation*} \max\bigl\{0,1-y_l\,(a^{\mathrm{T}}\,x_l+b)\bigr\}. \end{equation*}\]

Thus, solving the minimization problem with respect to \(s_1,\ldots,s_n\) (with fixed \(a\) and \(b\)) yields the optimal value

\[\begin{equation*} \min_{a,b}\frac{1}{n}\,\sum_{l=1}^n\max\bigl\{0,1-y_l\,(a^{\mathrm{T}}\,x_l+b)\bigr\}+\alpha\,|a|^2. \end{equation*}\]

This shows that the quadratic problem with slack variables is equivalent to the original non-differentiable problem.

Another Reformulation#

Applying some mathematical standard techniques for transforming optimization problems (Lagrange duality) we may derive another reformulation of the soft margin SVM minimization problem:

\[\begin{equation*} \sum_{l=1}^n c_l-\frac{1}{2}\,\sum_{l=1}^n\sum_{\lambda=1}^n y_l\,y_\lambda\,\big(x_l^\mathrm{T}\,x_\lambda\big)\,c_l\,c_\lambda\to\max_{c_1,\ldots,c_n} \end{equation*}\]
\[\begin{equation*} \text{with constraints}\quad \sum_{l=1}^n c_l\,y_l=0\quad\text{and}\quad 0\leq c_l\leq\frac{1}{2\,n\,\alpha},\;l=1,\ldots,n. \end{equation*}\]

This again is a quadratic optimization problem with linear constraints. From \(c_1,\ldots,c_n\) we obtain the centered separating hyperplane with maximum margin by (without proof):

\[\begin{equation*} a=\sum_{l=1}^nc_l\,y_l\,x_l,\qquad b=y_\lambda-\sum_{l=1}^n c_l\,y_l\,x_l^\mathrm{T}\,x_\lambda\quad\text{for some $\lambda$ with $0<c_\lambda<\frac{1}{2\,n\,\alpha}$.} \end{equation*}\]

From duality theory one obtains the following interpretation of the \(c_l\):

  • If \(c_l=0\), then \(x_l\) is on the correct side of the margin.

  • If \(c_l=\frac{1}{2\,n\,\alpha}\), then \(x_l\) is inside the margin or on the wrong side.

  • If \(0<c_l<\frac{1}{2\,n\,\alpha}\), then \(x_l\) is on the boundary between margin and correct side.

Support Vectors#

In the context of soft margin SVMs support vectors are samples \(x_l\) which are not classified correctly or lie on the margin’s boundary. With the above reformulation of the soft margin minimization problem support vectors are characterized by \(c_l>0\).

support vectors

Fig. 54 In contrast to hard margin SVMs support vectors may lie inside the margin.#

From the above reformulation we immediately see that the separating hyperplane (that is, \(a\) and \(b\)) can be calculated from the support vectors. Thus, all other training samples do not influence classification.

Given some input \(x\) prediction is the sign of

\[\begin{equation*} \sum_{l=1}^n c_l\,y_l\,x_l^\mathrm{T}\,x-\sum_{l=1}^n c_l\,y_l\,x_l^\mathrm{T}\,x_\lambda+y_\lambda\quad\text{for some $\lambda$ with $0<c_\lambda<\frac{1}{2\,n\,\alpha}$}. \end{equation*}\]

Most of the \(c_l\) are zero. Only the (few) support vectors are required for calculating predictions. Thus, predictions from SVMs are very fast.

Another remarkable feature of the reformulated minimization problem is that the minimization problem as well as corresponding predictions only depend on inner products of (training) inputs, not on the \(x_l\) themselves.