Detecting Forgery with k-NN#

Banknotes have lots of security features, some well known (see Deutsche Bundesbank) and some less known like the EURion constellation. Machine learning methods allow to investigate features not designed for human vision or simple algorithmic evaluation.

One approach is to have a closer look at the printing quality. Banknotes are printed using a technique know as intaglio. That technique produces extremely sharp edges if steel plates are used and cannot be realized with off-the-shelf machines. Researchers from Institut für industrielle Informationstechnik at Technischen Hochschule Ostwestfalen-Lippe created a data set for training banknote authentication systems based on scanned images of banknotes.

The Data Set#

The data set is available from UCI Machine Learning Repository. It comes as a simple CSV file without header.

Task: Download the data set and read it into a Pandas data frame. Get column names from UCI webpage of the data set. Adjust data types if necessary. Look at this blog post by James D. McCaffrey to find out how to interpret class labels.

Solution:

# your solution

Task: Determine class sizes.

# your solution

The data set does not contain scanned images, but some statistical information about the histograms.

Task: Read sections 1 and 2 of Banknote Authentication to get a rough idea of what the features in the data set express. Note that wavelet transforms are similar to Fourier transforms.

# your notes

Visualization#

Task: Create a pairplot of the 4 features with different colors for the two classes.

# your solution

Task: For each combination of 3 features create a 3d scatter plot (again different colors for classes).

# your solution

Note

From the 3d plots we see that variance, skewness, curtosis should suffice to separate forged from genuine banknotes. From this point of view the entropy feature can be neglected. Further, the data set description does not contain information on how the entropy was calculated. If we want to use a model trained on the data set to classify new samples, we do not know how to derive model inputs from scanned images. This is a second reason to drop the entropy feature.

k-NN Predictions#

Task: Create model inputs and outputs in Scikit-Learn format. That is, create a NumPy array X with one row per sample and one column per feature (do not include entropy) and a one-dimensional NumPy array with outputs for all samples (use integers, no booleans).

# your solution

Task: Create a k-NN classifier with Scikit-Learn. For the moment we do not consider hyperparameter optimization. Choose \(k=11\) and no weighting. Test set size should be 30 per cent. Compute accuracy on test and training sets. Get the number of missclassified samples in the test set.

Solution:

# your solution