Hard Margin SVMs#

Given training samples $(x_1,y_1),\ldots,(x_n,y_n)$ with inputs in $\mathbb{R}^m$ and labels in $\{-1,1\}$ (binary classification) we want to find a separating hyperplane with maximum distance to both classes. The distance betwenn both classes with respect to a hyperplane is called margin.

Fig. 48 Typically, the margin w.r.t. a hyperplane is smaller than the distance between two sets.#

Hyperplanes#

A hyperplane is the set of all $x$ in $\mathbb{R}^m$ satisfying

\[\begin{equation*} a^\mathrm{T}\,x+b=0. \end{equation*}\]

The vector $a\in\mathbb{R}^m\setminus\{0\}$ controls the direction of the hyperplane (normal vector) and $b\in\mathbb{R}$ controls the distance to the origin (the distance is $\frac{|b|}{|a|}$). If two normal vectors only differ in length, not in direction, then corresponding hyperplanes are in parallel. The hyperplane equation can be multiplied by any nonzero real number without effecting the hyperplane. Thus, many different pairs $(a,b)$ yield the same hyperplane.

A hyperplane is the level set for level 0 of a function

\[\begin{equation*} h_{a,b}:\mathbb{R}^m\rightarrow\mathbb{R},\quad h_{a,b}(x):=a^\mathrm{T}\,x+b. \end{equation*}\]

On one side of the hyperplane we have $h_{a,b}(x)>0$. On the other side we have $h_{a,b}(x)<0$. The absolute value $|h_{a,b}(x)|$ grows linearly with the distance of $x$ to the hyperplane. All level sets of $h_{a,b}$ are hyperplanes parallel to the hyperplane $h_{a,b}(x)=0$.

hyperplane as level set of a linear function

Fig. 49 A hyperplane can be regarded as the zero level set of a linear function.#

Hint

For brevity we’ll write ‘the hyperplane $h_{a,b}(x)=0$’ instead of the more correct ‘the hyperplane $\{x\in\mathbb{R}^m:\,h_{a,b}(x)=0\}$’.

Separating Hyperplanes#

Let $L^+$ and $L^-$ be the index sets of positive and negative samples, respectively. That is,

\[\begin{equation*} L^+:=\bigl\{l\in\{1,\ldots,n\}:\,y_l=1\bigr\}\quad\text{and}\quad L^-:=\bigl\{l\in\{1,\ldots,n\}:\,y_l=-1\bigr\}. \end{equation*}\]

A hyperplane $h_{a,b}(x)=0$ is called separating (with respect to the given data set) if

\[\begin{equation*} h_{a,b}(x_l)>0\quad\text{for}\quad l\in L^+\qquad\text{and}\qquad h_{a,b}(x_l)<0\quad\text{for}\quad l\in L^-. \end{equation*}\]

We may rewrite this condition as

\[\begin{equation*} y_l\,h_{a,b}(x_l)>0\quad\text{for all $l$}. \end{equation*}\]

Hint

In our definition of ‘separating’ we not only require separation of both classes but also that the negative class is on the negative side of the hyperplane and the positive class is on the positive side. If classes are on the wrong side, that is, if $y_l\,h_{a,b}(x_l)<0$ for all $l$, then $h_{-a,-b}(x)=0$ is a separating hyperplane in terms of our definition. But note that $h_{a,b}(x)=0$ and $h_{-a,-b}(x)=0$ in fact are two descriptions of one and the same hyperplane.

Given a separating hyperplane $h_{a,b}(x)=0$ the margin is

\[\begin{align*} \text{margin}&=\min_{l\in L^+}\frac{|a^\mathrm{T}\,x_l+b|}{|a|}+\min_{l\in L^-}\frac{|a^\mathrm{T}\,x_l+b|}{|a|}\\ &=\min_{l\in L^+}\frac{a^\mathrm{T}\,x_l+b}{|a|}+\min_{l\in L^-}\frac{-(a^\mathrm{T}\,x_l+b)}{|a|}\\ &=\min_{l\in L^+}\frac{a^\mathrm{T}\,x_l+b}{|a|}-\max_{l\in L^-}\frac{a^\mathrm{T}\,x_l+b}{|a|}\\ &=\min_{l\in L^+}\frac{a^\mathrm{T}\,x_l}{|a|}-\max_{l\in L^-}\frac{a^\mathrm{T}\,x_l}{|a|}\\ &=\min_{l\in L^+}\Bigl(\frac{a}{|a|}\Bigr)^\mathrm{T}\,x_l-\max_{l\in L^-}\Bigl(\frac{a}{|a|}\Bigr)^\mathrm{T}\,x_l. \end{align*}\]

Obviously, the margin does not depend on $b$ and it does not depend on the length of $a$ (because $a$ gets normalized in the formula above). Solely the direction of $a$ matters. Note that $\bigl(\frac{a}{|a|}\bigr)^\mathrm{T}\,x_l$ is the (signed) distance between the origin and the projection of $x_l$ onto the subspace spanned by $a$. Set

\[\begin{equation*} d_a^+:=\min_{l\in L^+}\Bigl(\frac{a}{|a|}\Bigr)^\mathrm{T}\,x_l\qquad\text{and}\qquad d_a^-:=\max_{l\in L^-}\Bigl(\frac{a}{|a|}\Bigr)^\mathrm{T}\,x_l. \end{equation*}\]

Then

\[\begin{equation*} \text{margin}=d_a^+-d_a^-. \end{equation*}\]

Fig. 50 Margin with respect to a separating hyperplane with corresponding notation.#

A Centered Separating Hyperplane#

Given a vector $a$ with $d_a^+>d_a^-$ each $b$ with

\[\begin{equation*} -|a|\,d_a^+<b<-|a|\,d_a^- \end{equation*}\]

yields a separating hyperplane $h_{a,b}(x)=0$ (prove this!).

For fixed $a$ there is only one separating hyperplane with equal distances to both classes. Corresponding b is

\[\begin{equation*} b=-|a|\,\frac{d_a^++d_a^-}{2}. \end{equation*}\]

To see this simply calculate the distances:

\[\begin{align*} \text{distance to positive class}&=\min_{l\in L^+}\frac{|a^\mathrm{T}\,x_l+b|}{|a|} =\min_{l\in L^+}\frac{a^\mathrm{T}\,x_l-|a|\,\frac{d_a^++d_a^-}{2}}{|a|} =d_a^+-\frac{d_a^++d_a^-}{2} =\frac{d_a^+-d_a^-}{2}\\ \text{distance to negative class}&=\min_{l\in L^-}\frac{|a^\mathrm{T}\,x_l+b|}{|a|} =\min_{l\in L^-}\frac{-\left(a^\mathrm{T}\,x_l-|a|\,\frac{d_a^++d_a^-}{2}\right)}{|a|} =-\max_{l\in L^-}\frac{a^\mathrm{T}\,x_l-|a|\,\frac{d_a^++d_a^-}{2}}{|a|}\\ &=-\left(d_a^--\frac{d_a^++d_a^-}{2}\right) =-\frac{d_a^--d_a^+}{2} =\frac{d_a^+-d_a^-}{2} \end{align*}\]

Maximum Margin#

So far we know how to find a centered separating hyperplane given a fixed direction $a$ and we also know how to calculate the margin. To find a (centered) separating hyperplane with maximum margin we have to solve

\[\begin{equation*} \min_{l\in L^+}\Bigl(\frac{a}{|a|}\Bigr)^\mathrm{T}\,x_l-\max_{l\in L^-}\Bigl(\frac{a}{|a|}\Bigr)^\mathrm{T}\,x_l\to\max_{a\in\mathbb{R}^m} \end{equation*}\]

for $a$ and then calculate $b$. This minimization problem is non-differentiable and lacks any other useful structure for analytical or numerical minimization.

Although the idea of a separating hyperplane with maximum margin is simple and straight forward, the major contribution of the inventors of SVMs (Vapnik and Chervonenkis) is a reformulation of the margin maximization problem as a quadratic minimization problem. A minimization problem is called quadratic if the objective function is quadratic and all contraints are linear (the set of feasible points is an intersection of half spaces). There exist several very efficient algorithms for solving quadratic minimization problems, making margin maximization a computationally tractable task.

Parameters $a$ and $b$ for the centered separating hyperplane with maximum margin are the solution to

\[\begin{equation*} \boxed{|a|^2\to\min_{a\in\mathbb{R}^m}\qquad\text{with constraints}\quad\begin{cases}a^{\mathrm{T}}\,x_l+b\geq 1,&\text{for }l\in L^+,\\a^{\mathrm{T}}\,x_l+b\leq -1&\text{for }l\in L^-.\end{cases}} \end{equation*}\]

feasibility and optimality of different hyperplanes

Fig. 51 A hyperplane has to be feasible and has to have maximum slope to solve the optimization problem.#

We now derive the quadratic minimization problem from our considerations above.

We start with fixed $a$ and consider the corresponding centered separating hyperplane (assuming there is a separating hyperplane with normal vector $a$):

\[\begin{equation*} a^{\mathrm{T}}\,x-|a|\,\frac{d_a^++d_a^-}{2}=0. \end{equation*}\]

Dividing the equation by $|a|\,\frac{d_a^+-d_a^-}{2}$ (second factor is half the margin) does not change the hyperplane. Resulting parameters

\[\begin{equation*} a^\ast:=\frac{2}{d_a^+-d_a^-}\,\frac{a}{|a|}\qquad\text{and}\qquad b^\ast:=-\frac{d_a^++d_a^-}{d_a^+-d_a^-} \end{equation*}\]

do not depend on the length of $a$ but solely on its direction. We now have

\[\begin{equation*} h_{a^\ast,b^\ast}(x_l)\geq\frac{2}{d_a^+-d_a^-}\,d_a^+-\frac{d_a^++d_a^-}{d_a^+-d_a^-}=\frac{d_a^+-d_a^-}{d_a^+-d_a^-}=1\qquad\text{for}\quad l\in L^+ \end{equation*}\]

and

\[\begin{equation*} h_{a^\ast,b^\ast}(x_l)\leq\frac{2}{d_a^+-d_a^-}\,d_a^--\frac{d_a^++d_a^-}{d_a^+-d_a^-}=\frac{d_a^--d_a^+}{d_a^+-d_a^-}=-1\qquad\text{for}\quad l\in L^-, \end{equation*}\]

that is $a^\ast$ and $b^\ast$ satisfy the constraints of the quadratic minimization problem.

Next we show that whenever we have a hyperplane $h_{a,b}(x)=0$ satisfying the constraints and with $a$ having the same fixed direction as $a^\ast$, then $|a|\geq|a^\ast|$. That is, $a^\ast$ and $b^\ast$ solve the quadratic minimization problem if we only consider one direction. Remember that as long as all considered normal vectors $a$ have the same direction (but different length) all such $a$ yield identical $a^\ast$. With

\[\begin{equation*} |a|\,d_a^+=\min_{l\in L^+}a^{\mathrm{T}}\,x_l\geq 1-b \end{equation*}\]

and

\[\begin{equation*} |a|\,d_a^-=\max_{l\in L^-}a^{\mathrm{T}}\,x_l\leq -1-b \end{equation*}\]

we see

\[\begin{align*} |a^\ast|=\frac{2}{d_a^+-d_a^-}\leq\frac{2}{\frac{1-b}{|a|}-\frac{-1-b}{|a|}}=|a|. \end{align*}\]

The final step is to show that $|a^\ast|$ is the smaller the larger the margin is. But this follows immediately from

\[\begin{align*} |a^\ast|=\frac{2}{d_a^+-d_a^-} \end{align*}\]

because $d_a^+-d_a^-$ is the margin.

Support Vectors#

Given the centered separating hyperplane $h_{a,b}(x)=0$ with maximum margin each sample $x_l$ satisfying $h_{a,b}(x_l)=\pm 1$ is called support vector. If we remove all samples from the data set but the support vectors, the SVM solution does not change. The solution is supported by the support vectors.

Fig. 52 Support vectors are data points on the margin’s boundary.#

Additional training samples alter the trained model only if they lie inside the margin! Thus, SVM classifiers are very robust to changes in the training set.

Hard Margin SVMs

Contents