If you are viewing this file in preview mode, some links won't work. Find the fully featured Jupyter Notebook file on the website of Prof. Jens Flemming at Zwickau University of Applied Sciences. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Overfitting and regularization

Whenever we try to fit a model to a finite data set we have to find a compromise between two competing aims:

One problem is that data usually contains noise and thus does not contain arbitrarily precise information about the underlying truth. On the other hand, in most applications there is not the one underlying truth. If some relevant features are not contained in the data set, then even a complete data set does not allow to recover underlying truth.

An example for the second issue is prediction of prices, say house prices. The price depends on many features, which cannot be recorded completely. Thus, corresponding data set might contain a feature vector twice, but with different target values (prices). Which is the better one, that is, which one contains more truth?

Fitting data as good as possible is quite easy. The hard part is to avoid overfitting. By overfitting we mean neglecting the second aim. That is, our hypothesis fits the data very well, but does not represent essential properties of the underlying truth.

Example

Let's have a look at an illustrating example. We consider data with only one feature, so we can plot everything and see the problem.

First some standard imports and initialization of the random number generator.

To investigate overfitting we choose an underlying truth and simulate data based on this truth. This way we have access to the in practice unknown truth and can compare predictions to the truth.

To simulate data we generate uniformly distributed arguments, calculate corresponding true function values, and add some noise. Noise almost always follows a normal distribution.

We use polynomial regression for obtaining a model explaining our data. Different degrees of the polynomial will yield very different results (try 1, 2, 5, 10, 15, 20, 25).

Obviously, there is an optimal degree, say 4 or 5 or 6. The degree in polynomial regression is a hyperparameter. Techniques for choosing optimal hyperparameters will be discussed in a subsequent lecture.

For lower degrees our model is not versatile enough to grasp the truth's structure. For higher degrees we observe overfitting: the model adapts very well to the data points, but tends to oscillate to reach as many data points as possible. These oscillations are an artifact and not a characteristic of the underlying truth.

Before we discuss how to avoid overfitting, we have to think about a different issue: How to detect overfitting? In our illustrating example we know the underlying truth and can compare the hypothesis to the truth. But in practics we do not know the truth!

Detecting overfitting

We split our data set into two subsets: one for fitting the model and one for detecting overfitting. If our model is close to the (unknown) truth, then the error on both subsets should be almost identical. In case of overfitting the error on the subset used for fitting the model will be much smaller than on the withheld subset.

Here, the error is the distance between predictions and target values, also known as mean squared error: \begin{equation*} \frac{1}{n}\,\sum_{k=1}^n\bigl(f_{\mathrm{approx}}(x_k)-y_k\bigr)^2, \end{equation*} where $(x_1,y_1),\ldots,(x_n,y_n)$ are the samples from the considered subset.

Let's test this with the above example. First we split the data set.

Now we fit models for different degrees and plot corresponding errors on the training set and on the validation set.

The higher the degree, the smaller the error on the training set, but the higher the error on the validation set. Starting at degree 9 the difference between both errors grows. This shows that the small error on the training set is not a result of a well approximated truth, but stems from overfitting.

Here we also see that the error on the validation set is slightly larger than on the training set, because the hypothesis has been fitted to the training data.

Avoiding overfitting

Overfitting almost always correlates with very large parameters after fitting the model. Thus, penalizing parameter values should be a good idea.

Let's have a look at the parameters in our illustrative example for both cases good fit and overfitting.

In linear regression and most other method one minimizes a loss function expressing the distance between the hypothesis $f_{\mathrm{approx}}$ and the targets in the training data: \begin{equation*} \frac{1}{n}\,\sum_{l=1}^n\bigl(f_{\mathrm{approx}}(x_l)-y_l\bigr)^2\to\min_{a_1,\ldots,a_\mu}, \end{equation*} where $a_1,\ldots,a_\mu$ are the parameters of the model. If we add the squares of the parameters to this function, then we not only force the hypothesis to be close to the data, but we also ensure that the parameters cannot become to large. As we mentioned above, large parameters correlate with overfitting. Modifying a minimization problem in this way is known as regularization.

To control the trade-off between data fitting and regularization, we introduce a regularization parameter $\alpha\geq 0$: \begin{equation*} \frac{1}{n}\,\sum_{l=1}^n\bigl(f_{\mathrm{approx}}(x_l)-y_l\bigr)^2 +\alpha\,\frac{1}{\mu}\,\sum_{\kappa=1}^\mu a_\kappa^2\to\min_{a_1,\ldots,a_\mu}. \end{equation*} The regularization parameter $\alpha$ is an additional hyperparameter of the model. How to choose hyperparameters will be described in a subsequent lecture.

There are several other penalty terms, which will be discussed below. Adding squares of the model parameters is the simplest version from the view of computational efficiency. Linear regression regularized this way is also known as Ridge regression.

Scikit-Learn implements Ridge regression in the linear_model module: Ridge.

In our exmaple we know the underlying truth. Thus, we may compare predictions from the regularized model to the truth for different regularization parameters.

For $\alpha$ close to zero overfitting leads to large errors. For large $\alpha$ model parameters are close to zero, which leads to very bad data fitting and, thus, to large errors, too (overregularization). Between both ends there is a local minimum, yielding the optimal $\alpha$.

In practice we do not know the truth. But analogously to detecting overfitting we may find values for $\alpha$, where overfitting vanishes. We simply have to start with very small $\alpha$ leading to overfitting. Then we increase $\alpha$ until training and validation data yield similar mean squared errors when compared to the hypothesis.

Note that this way only values for $\alpha$ leading to overfitting can be detected. But overregularization does not lead to large differences in the errors. Thus overregularization is indistinguishable from good fitting if only errors on training and validation sets are compared.

Next to adding squares of the model parameters there are several other choices for the penalty. Here we only consider two of them.

Regularized linear regression with LASSO and Elastic Net penalties is available in Scikit-Learn's linear_model module: Lasso, ElasticNet.