Scaling#

Scaling of numeric data may influence results obtained from supervised learning methods. Often this influence is not obvious. The method itself might be sensitive to scaling, but more often scaling issues arise from underlying numerical algorithms (e.g., minimization procedures) for implementing a method.

We already met an example showing the importance of scaling in Introductory Example (k-NN). More will follow when discussing more machine learning techniques. Here we have a look at two standard approaches to scaling: normalization and standardization.

Normalization#

One common method for scaling data is to choose an interval, often \([0,1]\), and to linearly transform values to fit this interval. If a feature’s values are in the interval \([a,b]\), then transformation to \([0,1]\) is done by

\[\begin{equation*} x_{\mathrm{new}}=\frac{x_{\mathrm{old}}-a}{b-a}. \end{equation*}\]

Care has to be taken if data contains outliers: a very large value in the date would force values in the usual range to be mapped very close to zero.

Scikit-Learn offers normalization as MinMaxScaler class in the preprocessing module. MinMaxScaler objects (like most of Scikit-Learn’s objects) offer the three methods fit, transform, fit_transform. The latter is simple a convinience method which calls fit and then transform. The fit method looks at the passed data and determines its range. The transform method applies the actual transformation. Thus, if multiple data sets shall be transformed, call fit only once and then apply the transform to all data sets:

from sklear.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)    # get range (no transform here)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Alternatively:

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Standardization#

More often than normalization the following approach is used for scaling data: First substract the mean, then divide by standard deviation. The result are features whose values have mean 0 and standard deviation 1. That is, values are centered at 0 and their mean deviation from 0 is 1.

Given values \(x_1,\ldots,x_n\) the mean \(\mu\) is

\[\begin{equation*} \mu=\frac{1}{n}\,\sum_{l=1}^n x_l \end{equation*}\]

and standard deviation \(\sigma\) is

\[\begin{equation*} \sigma=\sqrt{\frac{1}{n}\,\sum_{l=1}^n(x_l-\mu)^2}. \end{equation*}\]

Corresponding transform reads

\[\begin{equation*} x_{\mathrm{new}}=\frac{x_{\mathrm{old}}-\mu}{\sigma}. \end{equation*}\]

From the mathematical statistics view we are slighlt imprecise here. Our \(\mu\) is not the mean of the data’s underlying probability distribution, but an estimate for it, known as emperical mean in statistics. Same holds for \(\sigma\). But in addition, our estimate \(\sigma\) in some sence is worse than the usual emperical standard deviation in statistics, because it’s not unbiased (see statistics lecture).

Scikit-Learn offers StandardScaler in the preprocessing module for standardizing date. Usage is exactly the same as described above for normalization.

Scaling of Interdependent Features#

In many cases features may be scaled independently (age and kilometers driven for cars, for instance). But in other cases information isn’t solely contained in isolated features but differences between features may carry information, too. The most important example here are images. If we have a set of images and if we scale each pixel/feature independently, we may destroy information contained in the images.

images before and after pixelwise scaling

Fig. 31 Pixelwise normalization of images my destroy content. Pixels not covering the full color range will get modified while pixels with values in the full range will remain untouched.#

Thus, for images and similar data, we have to apply same scaling to all pixels/features to keep information encoded as differences between features.