Scaling
Contents
Scaling#
Scaling of numeric data may influence results obtained from supervised learning methods. Often this influence is not obvious. The method itself might be sensitive to scaling, but more often scaling issues arise from underlying numerical algorithms (e.g., minimization procedures) for implementing a method.
We already met an example showing the importance of scaling in Introductory Example (k-NN). More will follow when discussing more machine learning techniques. Here we have a look at two standard approaches to scaling: normalization and standardization.
Normalization#
One common method for scaling data is to choose an interval, often \([0,1]\), and to linearly transform values to fit this interval. If a feature’s values are in the interval \([a,b]\), then transformation to \([0,1]\) is done by
Care has to be taken if data contains outliers: a very large value in the date would force values in the usual range to be mapped very close to zero.
Scikit-Learn offers normalization as MinMaxScaler
class in the preprocessing
module. MinMaxScaler
objects (like most of Scikit-Learn’s objects) offer the three methods fit
, transform
, fit_transform
.
The latter is simple a convinience method which calls fit
and then transform
.
The fit
method looks at the passed data and determines its range. The transform
method applies the actual transformation.
Thus, if multiple data sets shall be transformed, call fit
only once and then apply the transform to all data sets:
from sklear.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train) # get range (no transform here)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Alternatively:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Standardization#
More often than normalization the following approach is used for scaling data: First substract the mean, then divide by standard deviation. The result are features whose values have mean 0 and standard deviation 1. That is, values are centered at 0 and their mean deviation from 0 is 1.
Given values \(x_1,\ldots,x_n\) the mean \(\mu\) is
and standard deviation \(\sigma\) is
Corresponding transform reads
From the mathematical statistics view we are slighlt imprecise here. Our \(\mu\) is not the mean of the data’s underlying probability distribution, but an estimate for it, known as emperical mean in statistics. Same holds for \(\sigma\). But in addition, our estimate \(\sigma\) in some sence is worse than the usual emperical standard deviation in statistics, because it’s not unbiased (see statistics lecture).
Scikit-Learn offers StandardScaler
in the preprocessing
module for standardizing date. Usage is exactly the same as described above for normalization.
Scaling of Interdependent Features#
In many cases features may be scaled independently (age and kilometers driven for cars, for instance). But in other cases information isn’t solely contained in isolated features but differences between features may carry information, too. The most important example here are images. If we have a set of images and if we scale each pixel/feature independently, we may destroy information contained in the images.
Thus, for images and similar data, we have to apply same scaling to all pixels/features to keep information encoded as differences between features.