Jitter, Convolutional Neural Networks, and a Kaggle Framework

In this post, we’re going to look at the Digit Recognizer challenge from Kaggle. This challenge uses the MNIST dataset of handwritten digits. We’re going to train a Convolutional Neural Network with Keras to recognize the digits. We will employ some preprocessing steps to improve the generalization of our model.

In our analysis, we’ll use pandas, NumPy, and even a tool from sci-kit learn. A side goal is to build a framework that could be used in other Kaggle competitions as well. The key idea is to make a validation set from the training set to test your ideas, rather than submitting to the leaderboards.

Setup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    %%capture
    %matplotlib inline
    import pandas as pd
    import numpy as np
    from keras.callbacks import ModelCheckpoint
    from keras.models import Sequential
    from keras.layers.core import Dense, Activation, Dropout, Flatten
    from keras.layers.convolutional import Convolution2D, MaxPooling2D
    from scipy.ndimage.interpolation import rotate, shift, zoom
    from keras.datasets import mnist
    
    # Load data
    train = pd.read_csv('train.csv', dtype=np.float32)
    X_test = pd.read_csv('test.csv', dtype=np.float32).values
    
    np.random.seed(182)

Prepare the Training, Validation, and Test Set

When we take a look at the training set, we see that the label column has multiple values. Our eventual goal is to use a SoftMax layer for our network, so we will need to convert it to multiple columns with binary values. Because we are going to modify the images as we train, we need to use a separate validation set to see the progress of the Neural Network, as opposed to using Keras built in functionality. Out of convenience, we will use sklearn’s train_test_split function. We will use pandas get_dummies to convert the labels column to multiple values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    from sklearn.cross_validation import train_test_split
    
    y_train = pd.get_dummies(train.loc[:, "label"])
    print(y_train.head())
    
    # Get numpy version
    y_train = y_train.values.astype(np.int32)
    X_train = train.drop("label", axis = 1).values
    
    X_train, X_valid, y_train, y_valid = train_test_split(X_train,
                                                          y_train, 
                                                          test_size=0.1)
    
    X_train = X_train.reshape((X_train.shape[0],1,28,28))/255
    X_valid = X_valid.reshape((X_valid.shape[0],1,28,28))/255
    X_test = X_test.reshape((X_test.shape[0],1,28,28))/255

1  2  3  4  5  6  7  8  9
0  1  0  0  0  0  0  0  0  0
1  0  0  0  0  0  0  0  0  0
0  1  0  0  0  0  0  0  0  0
0  0  0  0  1  0  0  0  0  0
1  0  0  0  0  0  0  0  0  0

Model Definition

We are going to set up a simple CNN with two convolutional layers. The architecture will be:

Convolutional Layer (32x3x3) -> ReLu Activation (x2)
MaxPooling (2,2) -> Dropout (.25)
Fully Connected Layer (128 Units) -> ReLu Activation -> Dropout (.5)
Fully Connected Layer (10 Units) -> SoftMax

This is not a particularly creative architecture, and in fact, it’s the same architecture used in Keras’ example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    # Keras 0.3.2
    model = Sequential()
    model.add(Convolution2D(32,3,3, border_mode="valid", input_shape=(1,28,28)))
    model.add(Activation('relu'))
    
    model.add(Convolution2D(32,3,3))
    model.add(Activation('relu'))
    
    model.add(MaxPooling2D(pool_size=(2,2)))
    model.add(Dropout(.25))
    
    model.add(Flatten())
    model.add(Dense(128))
    model.add(Activation('relu'))
    model.add(Dropout(.5))
    model.add(Dense(10))
    model.add(Activation('softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adadelta')

Preprocessing

An important piece of this post is the preprocessing we will do to the images. We are going to introduce random noise (jargon: jitter) to the image. We’re going to randomly apply the jitter in the following ways:

Deleting a column
Deleting a row
Shifting the image
Rotating the image

We will use NumPy for all of these effects. We will perform each action with probability \(p=.7\). The probability that an image is perturbed is \(1-(1-.7)^4 = .99\). With a training set of over 32000, we expect approximately 320 images to stay the same. We will repeat this process each epoch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
    def rand_jitter(temp):
        if np.random.random() > .7:
            temp[np.random.randint(0,28,1), :] = 0
        if np.random.random() > .7:
            temp[:, np.random.randint(0,28,1)] = 0
        if np.random.random() > .7:
            temp = shift(temp, shift=(np.random.randint(-3,3,2)))
        if np.random.random() > .7:
            temp = rotate(temp, angle = np.random.randint(-20,20,1), reshape=False)
        return temp


    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    
    # Copy to not effect the original
    ind = np.random.randint(len(X_train))
    test_image = lambda : np.copy(X_train[ind,0,:,:]) 
    
    # Jitter examples
    plt.figure()
    f, ax = plt.subplots(2, 2)
    for k in range(2):
        for j in range(2):
            ax[k,j].imshow(rand_jitter(test_image()))

png

Training

In the interest of time, I will only train two epochs. Each epoch we copy the training set so that we can modify the array itself when we apply the jitter. The first epoch we use the original training set, but every epoch afterwards we use the perturbed data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    # Callback for model saving:
    checkpointer = ModelCheckpoint(filepath="auto_save_weights.hdf5", 
                                   verbose=1, save_best_only=True)
    
    # Parameters
    n_epochs = 2
    
    # Training
    for k in range(0, n_epochs):
        X_train_temp = np.copy(X_train) # Copy to not effect the originals
        
        # Add noise on later epochs
        if k > 0:
            for j in range(0, X_train_temp.shape[0]):
                X_train_temp[j,0, :, :] = rand_jitter(X_train_temp[j,0,:,:])
    
        model.fit(X_train_temp, y_train, nb_epoch=1, batch_size=128, 
                  validation_data=(X_valid, y_valid), show_accuracy=True, verbose=1, 
                  callbacks=[checkpointer])

Train on 37800 samples, validate on 4200 samples
Epoch 1/1
37800/37800 [==============================] - 508s - loss: 0.2920 - acc: 0.9100 - val_loss: 0.0816 - val_acc: 0.9710
Epoch 00000: val_loss improved from inf to 0.08161, saving model to auto_save_weights.hdf5
Train on 37800 samples, validate on 4200 samples
Epoch 1/1
37800/37800 [==============================] - 506s - loss: 0.2615 - acc: 0.9199 - val_loss: 0.0529 - val_acc: 0.9807
Epoch 00000: val_loss improved from 0.08161 to 0.05290, saving model to auto_save_weights.hdf5

Predicting on the Test Set

If we were going to submit this model to the competition, then we’d want to retrain the model on the full training set when we are happy with our validation score. If we were using a less time consuming approach, we could choose to use cross validation rather than a validation set. Either way, it’s important to train the final model on the full data set.

But it will not make a difference for our demonstration, so we’ll make our predictions on the test set. As a sanity check, we will validate a prediction by visualizing the input image and comparing to our prediction.

1
2
3
4
5
6
7
    # Make predictions
    predictions = model.predict(X_test, batch_size=128)


    # Visualize prediction
    print("Predicted value: %d" % np.argmax(predictions[1]))
    plt.imshow(X_test[1,0,:,:])

Predicted value: 0

png

Scott Rome

Jitter, Convolutional Neural Networks, and a Kaggle Framework

Table of Contents

Setup

Prepare the Training, Validation, and Test Set

Model Definition

Preprocessing

Training

Predicting on the Test Set

You May Also Enjoy

Combining Thompson Sampling with Neural Networks for Classification Tasks Practical Machine Learning

Online Calculation of the Mean and Variance of Logistic Regression Weights for Thompson Sampling Practical Machine Learning

Predicting Watch Time like YouTube via Weighted Logistic Regression Practical Machine Learning

Comparing Online Linear Regression and Kalman Filter Approaches to Online Models Plus Derivations Practical Machine Learning