Problem and Workflow#

Here we introduce basic terminology used throughout the book and we present major problems arising when implementing supervised learning methods.

Statement of the Problem#

Abstract Formulation#

Let \(f:X\rightarrow Y\) be a function between two sets \(X\) and \(Y\) and assume we only know \(f\) at finitely many elements of \(X\). The aim is to find an approximation \(f_{\mathrm{approx}}\) of \(f\), which can be evaluated at arbitrary elements of \(X\).

basic notation and setting for supervised learning

Fig. 27 Given the a mapping \(f\) on finitely many points supervised learning aims at finding a mapping \(f_{\mathrm{approx}}\) on the whole space.#

\(X\) is the set of inputs, \(Y\) is the set of outputs or labels. The finitely many elements \(x_1,\ldots,x_n\) in \(X\) at which \(f\) is known are called training inputs. Corresponding labels \(y_1:=f(x_1),\ldots,y_n:=f(x_n)\) are called training labels. The pairs \((x_1,y_1),\ldots,(x_n,y_n)\) are the training examples or training data. Ultimately, the function \(f_{\mathrm{approx}}\) will be represented by a Python program or a Python function: it takes an input and yields some output.

Standard Situation#

We only consider \(X=\mathbb{R}^m\), that is, \(m\)-dimensional vectors as inputs. Here, \(m\) is a positive integer. The elements of \(X\) are called feature vectors. Each feature vector contains values for \(m\) features. In addition, we restrict our attention to labels in finite or infinite subsets \(Y\subseteq\mathbb{R}\).

If there are only finitely many labels, then the supervised learning problem of finding \(f_{\mathrm{approx}}\) is a classification problem. With infinitely many labels it’s a regression problem.

Example: Classification of Handwritten Digits#

Recognizing handwritten digits is a typical classification task. Given an image of a handwritten digit we have to decide which digit it contains. Assuming gray scale images, where each pixel’s color is described by a real number, the set of inputs has dimension \(m=\text{width}\times\text{height}\). If the images are 28 by 28 pixels in size (like in QMNIST data set), then each input has 784 features. The label set is \(Y=\{0,1,2,3,4,5,6,7,8,9\}\).

example of inputs and label for QMNIST image classification

Fig. 28 The sample \(x_1\) is a vector with 784 components, one component (or feature) per pixel.#

Example: Predicting House Prices#

Given several properties of a house we would like to predict the price we have to pay for buying that house. The properties we take into account yield the feature vectors. The output is the price, a real number. Thus, it’s a regression problem. Observing house prices on the market yields training examples, which then can be used to predict prices for houses not sold during our market observation.

Hypotheses, Parameters, Models#

An approximate function \(f_{\mathrm{approx}}\) is sometimes called hypothesis in machine learning contexts. The set of all functions taken into consideration is the hypotheses space.

Usually one does not consider general functions. Instead one restricts search to some family of functions, e.g. linear functions, described by a number of parameters. If that family has \(p\) parameters, then we have a \(p\)-dimensional hypotheses space. Fitting the parameters of such parameterdependent hypotheses to training examples is much easier than fitting a general hypothesis.

We have to distinguish between parameters of the hypothesis and parameters of the whole approach. The latter are called hyperparameters. Dimension of the hypotheses space is a hyperparameter, for instance.

A parameterdependent hypothesis sometimes is called a model. If a fixed set of parameters has been chosen, then it’s a trained model.

Principal Steps in Supervised Learning#

The starting point for all supervised machine learning methods is a finite collection of correctly labeled inputs:

\[\begin{equation*} \{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}. \end{equation*}\]

Based on experience, technical limitations, and more involved considerations, which will be discussed later on, we have to choose a model we want to fit to the examples. From this point on the process of supervised learning almost always follows a fixed scheme:

  1. Split the data into training, validation and test sets.

  2. Train the model (fit the parameters to the training set).

  3. Optimize hyperparameters based on validation set.

  4. Evaluate the model on the test set.

  5. Use the trained model.

Training-Validation-Test Split#

At first we split the set of examples into three disjoint sets: training set, validation set, test set. The training set usually is the largest subset and will be used for training the model. The validation set is used for optimizing hyperparameters, and the test set is used in the evaluation phase. Typical ratios are 60:20:20 or 40:30:30, but others might be appropriate depending on the complexity of the model and of the underlying true function \(f\). Also the number of hyperparameters should be taken into account when splitting the set of examples. We will come back to this issue when dealing which concrete learning tasks.

Training#

By training we mean determination of parameters in a parameterized hypothesis. This typically involves mathematical optimization methods. The result of training a model is a concrete hypothesis \(f_{\mathrm{approx}}\), which can be used to predict labels for unknow feature vectors.

Hyperparameter Optimization#

The trained model might contain hyperparameters, the number of parameters, for instance. Choosing different hyperparameters could yield more accurate predictions. Thus, we should optimize prediction accuracy of the trained model with respect to the hyperparameters.

For this purpose we evaluate the trained model on the validation set and compare the predictions to the true labels. It’s important to use examples different from the training examples, since on the training set the model always yields very good predictions (if our training was succesful). With a separate validation set, results are much more realistic. There are many different error measures for expressing the difference between predictions and true labels on a validation set. We will come back to this topic later on.

Choosing different sets of hyperparameters and training and evaluating corresponding models, we see what set of hyperparameters yields the best predictions.

Evaluation#

After choosing optimal hyperparameters we have to test the final model on an independent data set: the test set. Now we see, whether the model also yields good predictions on feature vectors which did not take part in the training and validation process.

Python Packages for Supervised Learning#

There exist many different Python packages implementing standard methods of supervised machine learning. In this book we focus on few packages only. The focus will be on understanding methods and algorithms. Packages we use:

  • Scikit-Learn is a very well structured and easy to use package. It provides insight to all algorithms and their working. Thus, it’s best choice for learning and understanding new methods. For several years Scikit-Learn becomes more and more efficient. For instance, parallel computing is supported.

  • Tensorflow is a library for exploiting computational power of GPUs (graphics processing units). Typically, it’s used for fast matrix computations in neural network training.

  • Keras provides a user friendly interface to Tensorflow (and other backends). It’s the standard tool for deep learning with neural networks.

Other tools not covered here:

  • PyTorch is an alternative to Keras, which provides a more object-oriented programming interface.

Non-Numeric Data#

All machine learning methods expect purely numeric data (because computers do so). Thus, we have to convert everything to numbers.

We already discussed one-hot encoding for categorical data in Pandas, see Categorical Data. Scikit-Learn supports one-hot encoding, too, with its OneHotEncoder class in the preprocessing module.

Ordinal categegories (weekdays, for instance) may also be convert to sequences of integers (0 to 6 or 1 to 7 for weekdays). But beware of additional or incorrect structure we may add to data by conversion to numbers. Sometimes it’s more difficult than one might expect (distance between weekdays 0 and 6 is the same as between 0 and 1, for instance).

Later on we will have to discuss about conversion of text to numbers and other non-obvious conversion problems.