If you are viewing this file in preview mode, some links won't work. Find the fully featured Jupyter Notebook file on the website of Prof. Jens Flemming at Zwickau University of Applied Sciences. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Improving CNN performance

So far we only considered the basics of CNNs. Now we discuss techniques for improving prediction quality and for decreasing training times. First we introduce the ideas, then we implement all techniques for better cats and dogs classification.

Data augmentation

Prediction accuracy heavily depends on amount and variety of data available for training. Collecting more data is expensive. Thus, we could generate synthetic data from existing data. In case of image data we may rotate, scale, translate or distort the images to get new images showing identical objects in slightly different ways. This idea is known as data augmentation and increases the amount os data as well as the variety.

Kera's ImageDataGenerator class provides several types of data augmentation (rotation, zoom, pan, brightness, flip, and some more). Activating this feature yields a stream of augmented images.

Pre-trained CNNs

CNNs have to major components: the feature extraction stack (convolutional and pooling layers) and the decision stack (dense layers for classification or regression). The task of the feature extraction stack is to automatically preprocess images resulting in a set of feature maps containing higher level information than just colored pixels. Based on this higher level information the decision stack predicts the targets.

With this two-step approach in mind we may use more powerful feature extraction. The feature extraction part is more or less the same for all object classification problem in image processing. Thus, we might use a feature extraction stack trained on much larger training data and with much more computational resources. Such pre-trained CNNs are available in the internet and Keras ships with some, too. See Keras Applications for a list of pre-trained CNNs in Keras.

In Keras' documentation the feature extraction stack is called convolutional base and the decision stack is the head of the CNN. When loading a pre-trained model we have to decide wether to load the full model or only the convolutional base. If we do not use the pre-trained head, we have to specify the input shape for the network. This sounds a bit strange, but the convolutional base works for arbitrary input shapes and specifing a concrete shape fixes the output shape of the convolutional base. If use the pre-trained head, then the output shape of the convolutional base has to fit the input shape of the head. Thus, the head determines the input shape of the CNN.

Other minimization algorithms

Up to now we only considered simple gradient descent. But there are much better algorithms for minimizing loss functions. Keras implements some of them and we should use them although at the moment we do not know what those algorithms do in detail. We will have a look at advanced minimization techniques next semester in the lecture series on numerical methods.

Faster preprocessing

Loading images from the disk and preprocessing them during training might slow down training. One solution is to load all images (including augmentation) to memory before training, but large memory is required. Another solution is to asynchronously load and preprocess data. That is, while the GPU does some calculations the CPU loads and preprocesses images. Keras and TensorFlow support such advanced techniques, but we will not cover them here.

Example

We consider object detection with cats and dogs again.

We load a pre-trained convolutional base.

We see that there are new layer types: separable convolutions and batch normalization. Separable convolutions are a special case of usual convolution allowing for more efficient computation by restricting to specially structured filters. Batch normalization is a kind of rescaling layer outputs. The more important observation is the output shape: 4x4x2048. That is, we obtain 2048 feature maps each of size 4x4. This is where we connect our decision stack.

Models in Keras behave like layers (the Model class inherits from Layer). Thus, we may create a new model with the pre-trained convolutional base as one layer.

Before we start training we have to tell Keras to keep the weights of the convolutional base constant. We simply have to set the layer's trainable attribute to False:

For training we use Keras' default optimizer RMSProp.

To speed up training we would like to have all data in memory. Images have $128^2=16384$ pixels, each taking 3 bytes for the colors (one byte per channel) if color values are integers. For colors scaled to $[0,1]$ we need 4 bytes per channel with np.float32 as data type. Thus, we need 196608 bytes per image, say 200 kB. These are 5 images per MB or 5000 images per GB. Our data set has 25000 images and we could increase it to arbitrary size by data augmentation. Note that data augmentation is only useful for training data. Validation and test data should not be augmented. To save memory we do augmentation in real-time, that is, we only keep original training images in memory and generate batches of augmented images as needed.

To implement data augmentation we simply have to pass corresponding arguments to ImageDataGenerator. Since we do not want to augment validation and test images, we use two-step approach. We first load all images to memory. Then we use a second ImageDataGenerator object to create an iterator yielding augmented training images.

When using pre-trained models, data preprocessing has to be done in exactly the same way as has been done in training. For each pre-trained model in Keras there is a preprocess_input function doing necessary preprocessing. If images are provided to Model.fit by an iterator, we have to tell the iterator that the preprocessing function has to be applied before yielding an image. For this purpose ImageDataGenerator accepts the preprocessing_function argument.

The ImageDataGenerator.flow function streams images from memory while augmenting them.

Now training can be started. Since augmentation yields an infinite set of training data we have to tell fit the length of an epoch by providing the number of batches per epoch via steps_per_epoch argument.