Recurrent Neural Network (RNN) introduction

[Update 2017.06.11] Add chainer v2 code

How can we deal with the sequential data in deep neural network?

This formulation is especially important in natural language processing (NLP) field. For example, text is made of sequence of word. If we want to predict the next word from given sentence, the probability of the next word depends on whole past sequence of word.

So, the neural network need an ability to “remember” the past sentence to predict next word.

In this chapter, Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) are introduced to deal with sequential data.

Recurrent Neural Network (RNN)

Recurrent Neural Network

Recurrent Neural Network is similar to Multi Layer Perceptron introduced before, but a loop is added in its hidden layer (Shown in above figure with \( W_{hh} \)).
Here the subscript \(t\) represents the time step (sequence index). Due to this loop hidden layer unit \(h_{t-1}\) is fed again to construct hidden unit \(h_{t}\) of next sequence. Therefore, information of past sequence can be “stored” (memorized) in hidden layer and passed to next sequence.

You might wonder how the loop works in neural network in the above figure, below figure is the expanded version which explicitly explain how the loop works. 

Expanded figure of Recurrent Neural Network.

In this figure, data flow is from bottom (\(x\)) to top (\(y\)) and horizontal axis represents time step from left (time step=1) to right (time step=\(t\)).

Every time of the forward computation, it depends on the previous hidden unit \(h_{t-1} \). So the RNN need to keep this hidden unit as a state, see implementation below.

Also, we need to be careful when executing back propagation, because it depends on the history of consecutive forward computation. The detail will be explained in later.

RNN implementation in Chainer

Below code shows implementation of the most simple RNN implementation with one hidden (recurrent) layer, drawn in above figure. 

import chainer
import chainer.functions as F
import chainer.links as L


class RNN(chainer.Chain):
    """Simple Recurrent Neural Network implementation"""
    def __init__(self, n_vocab, n_units):
        super(RNN, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.l1 = L.Linear(n_units, n_units)
            self.r1 = L.Linear(n_units, n_units)
            self.l2 = L.Linear(n_units, n_vocab)
        self.recurrent_h = None

    def reset_state(self):
        self.recurrent_h = None

    def __call__(self, x):
        h = self.embed(x)
        if self.recurrent_h is None:
            self.recurrent_h = F.tanh(self.l1(h))
        else:
            self.recurrent_h = F.tanh(self.l1(h) + self.r1(self.recurrent_h))
        y = self.l2(self.recurrent_h)
        return y

EmbedID link

L.EmbedID is used in the above RNN implementation. This is convenient method if you want to input data which can be represented as ID.

EmbedID takes integer ID as input, and output 1-d vector with size out_size.

In NLP with text processing, each word is represented as ID in integer. EmbedID layer convert this id into vector which can be considered as vector representation of the word.

More precisely, EmbedID layer works as combination of 2 operations:

  1. Convert integer ID into in_size dimensional one-hot vector.
  2. Apply Linear layer (with bias \(b = 0\)) to this one-hot vector to output out_size units.

See official document for details,

Creating RecurrentBlock as sub-module

If you want to create more deep RNN, you can make recurrent block as a sub module layer like below.

import chainer
import chainer.functions as F
import chainer.links as L


class RecurrentBlock(chainer.Chain):
    """Subblock for RNN"""
    def __init__(self, n_in, n_out, activation=F.tanh):
        super(RecurrentBlock, self).__init__()
        with self.init_scope():
            self.l = L.Linear(n_in, n_out)
            self.r = L.Linear(n_in, n_out)
        self.rh = None
        self.activation = activation

    def reset_state(self):
        self.rh = None

    def __call__(self, h):
        if self.rh is None:
            self.rh = self.activation(self.l(h))
        else:
            self.rh = self.activation(self.l(h) + self.r(self.rh))
        return self.rh


class RNN2(chainer.Chain):
    """RNN implementation using RecurrentBlock"""
    def __init__(self, n_vocab, n_units, activation=F.tanh):
        super(RNN2, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.r1 = RecurrentBlock(n_units, n_units, activation=activation)
            self.r2 = RecurrentBlock(n_units, n_units, activation=activation)
            self.r3 = RecurrentBlock(n_units, n_units, activation=activation)
            #self.r4 = RecurrentBlock(n_units, n_units, activation=activation)
            self.l5 = L.Linear(n_units, n_vocab)


    def reset_state(self):
        self.r1.reset_state()
        self.r2.reset_state()
        self.r3.reset_state()
        #self.r4.reset_state()

    def __call__(self, x):
        h = self.embed(x)
        h = self.r1(h)
        h = self.r2(h)
        h = self.r3(h)
        #h = self.r4(h)
        y = self.l5(h)
        return y

Next: Training RNN with simple sequence dataset

CIFAR-10, CIFAR-100 inference code

The code structure of inference/predict stage is quite similar to MNIST inference code, please read this for precise explanation.

Here, I will simply put the code and its results.

CIFAR-10 inference code

Code is uploaded on github as predict_cifar10.py.

"""Inference/predict code for CIFAR-10

model must be trained before inference, 
train_cifar10.py must be executed beforehand.
"""
from __future__ import print_function
import os
import argparse

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training, iterators, serializers, optimizers, Variable, cuda
from chainer.training import extensions

from CNNSmall import CNNSmall
from CNNMedium import CNNMedium

CIFAR10_LABELS_LIST = [
    'airplane',
    'automobile',
    'bird',
    'cat',
    'deer',
    'dog',
    'frog',
    'horse',
    'ship',
    'truck'
]


def main():
    archs = {
        'cnnsmall': CNNSmall,
        'cnnmedium': CNNMedium,
    }

    parser = argparse.ArgumentParser(description='Cifar-10 CNN predict code')
    parser.add_argument('--arch', '-a', choices=archs.keys(),
                        default='cnnsmall', help='Convnet architecture')
    #parser.add_argument('--batchsize', '-b', type=int, default=64,
    #                    help='Number of images in each mini-batch')
    parser.add_argument('--modelpath', '-m', default='result-cifar10-cnnsmall/cnnsmall-cifar10.model',
                        help='Model path to be loaded')
    parser.add_argument('--gpu', '-g', type=int, default=-1,
                        help='GPU ID (negative value indicates CPU)')
    args = parser.parse_args()

    print('GPU: {}'.format(args.gpu))
    #print('# Minibatch-size: {}'.format(args.batchsize))
    print('')

    # 1. Setup model
    class_num = 10
    model = archs[args.arch](n_out=class_num)
    classifier_model = L.Classifier(model)
    if args.gpu >= 0:
        chainer.cuda.get_device(args.gpu).use()  # Make a specified GPU current
        classifier_model.to_gpu()  # Copy the model to the GPU
    xp = np if args.gpu < 0 else cuda.cupy

    serializers.load_npz(args.modelpath, model)

    # 2. Load the CIFAR-10 dataset
    train, test = chainer.datasets.get_cifar10()

    basedir = 'images'
    plot_predict_cifar(os.path.join(basedir, 'cifar10_predict.png'), model,
                       train, 4, 5, scale=5., label_list=CIFAR10_LABELS_LIST)


def plot_predict_cifar(filepath, model, data, row, col,
                       scale=3., label_list=None):
    fig_width = data[0][0].shape[1] / 80 * row * scale
    fig_height = data[0][0].shape[2] / 80 * col * scale
    fig, axes = plt.subplots(row,
                             col,
                             figsize=(fig_height, fig_width))
    for i in range(row * col):
        # train[i][0] is i-th image data with size 32x32
        image, label_index = data[i]
        xp = cuda.cupy
        x = Variable(xp.asarray(image.reshape(1, 3, 32, 32)))    # test data
        #t = Variable(xp.asarray([test[i][1]]))  # labels
        y = model(x)                              # Inference result
        prediction = y.data.argmax(axis=1)
        image = image.transpose(1, 2, 0)
        print('Predicted {}-th image, prediction={}, actual={}'
              .format(i, prediction[0], label_index))
        r, c = divmod(i, col)
        axes[r][c].imshow(image)  # cmap='gray' is for black and white picture.
        if label_list is None:
            axes[r][c].set_title('Predict:{}, Answer: {}'
                                 .format(label_index, prediction[0]))
        else:
            pred = int(prediction[0])
            axes[r][c].set_title('Predict:{} {}\nAnswer:{} {}'
                                 .format(label_index, label_list[label_index],
                                         pred, label_list[pred]))
        axes[r][c].axis('off')  # do not show axis value
    plt.tight_layout(pad=0.01)   # automatic padding between subplots
    plt.savefig(filepath)
    print('Result saved to {}'.format(filepath))


if __name__ == '__main__':
    main()

This outputs the result as,

You can see that even small CNN, it successfully classifies most of the images. Of course this is just a simple example and you can improve the model accuracy by tuning the deep neural network!

CIFAR-100 inference code

In the same way, code is uploaded on github as predict_cifar100.py.

CIFAR-100 is more difficult than CIFAR-10 in general because there are more class to classify but exists fewer number of training image data.

Again, the accuracy can be improved by tuning the deep neural network model, try it!

That’s all for understanding CNN, next is to understand RNN, LSTM used in Natual Language Processing.

Next: Recurrent Neural Network (RNN) introduction

CIFAR-10, CIFAR-100 training with Convolutional Neural Network

[Update 2017.06.11] Add chainer v2 code

Writing your CNN model

This is example of small Convolutional Neural Network definition, CNNSmall

import chainer
import chainer.functions as F
import chainer.links as L


class CNNSmall(chainer.Chain):
    def __init__(self, n_out):
        super(CNNSmall, self).__init__()
        with self.init_scope():
            self.conv1 = L.Convolution2D(None, 16, 3, 2)
            self.conv2 = L.Convolution2D(16, 32, 3, 2)
            self.conv3 = L.Convolution2D(32, 32, 3, 2)
            self.fc4 = L.Linear(None, 100)
            self.fc5 = L.Linear(100, n_out)

    def __call__(self, x):
        h = F.relu(self.conv1(x))
        h = F.relu(self.conv2(h))
        h = F.relu(self.conv3(h))
        h = F.relu(self.fc4(h))
        h = self.fc5(h)
        return h

I also made a slightly bigger CNN, called CNNMedium,

import chainer
import chainer.functions as F
import chainer.links as L


class CNNMedium(chainer.Chain):
    def __init__(self, n_out):
        super(CNNMedium, self).__init__()
        with self.init_scope():
            self.conv1 = L.Convolution2D(None, 16, 3, 1)
            self.conv2 = L.Convolution2D(16, 32, 3, 2)
            self.conv3 = L.Convolution2D(32, 32, 3, 1)
            self.conv4 = L.Convolution2D(32, 64, 3, 2)
            self.conv5 = L.Convolution2D(64, 64, 3, 1)
            self.conv6 = L.Convolution2D(64, 128, 3, 2)
            self.fc7 = L.Linear(None, 100)
            self.fc8 = L.Linear(100, n_out)

    def __call__(self, x):
        h = F.relu(self.conv1(x))
        h = F.relu(self.conv2(h))
        h = F.relu(self.conv3(h))
        h = F.relu(self.conv4(h))
        h = F.relu(self.conv5(h))
        h = F.relu(self.conv6(h))
        h = F.relu(self.fc7(h))
        h = self.fc8(h)
        return h


It is nice to know the computational cost for Convolution layer, which is approximated as,

$$ H_I \times W_I \times CH_I \times CH_O \times k ^ 2 $$
  • \( CH_I \)  : Input image channel
  • \( CH_O \) : Output image channel
  • \( H_I \)     : Input image height
  • \( W_I \)    : Input image width
  • \( k \)           : kernal size (assuming same for width & height)

In above CNN definitions, the size of the channel is bigger for deeper layer. This can be understood by calculating the computational cost for each layer. 

When L.Convolution2D with stride=2 is used, the size of image become almost half. This means \( H_I\) and \( W_I \) becomes small value, so \(CH_I \) and \( CH_O \) can take larger value.

[TODO: add computational cost table for CNN Medium example]

Training CIFAR-10

Once you have written CNN, it is easy to train this model. The code, train_cifar10.py, is quite similar to MNIST training code.

Only small difference is the dataset preparation for CIFAR-10,

    # 3. Load the CIFAR-10 dataset
    train, test = chainer.datasets.get_cifar10()

and model setup

from CNNSmall import CNNSmall
from CNNMedium import CNNMedium

    archs = {
        'cnnsmall': CNNSmall,
        'cnnmedium': CNNMedium,
    }

    ...

    class_num = 10
    model = archs[args.arch](n_out=class_num)

The whole source code is the following,

from __future__ import print_function
import argparse

import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training, iterators, serializers, optimizers
from chainer.training import extensions

from CNNSmall import CNNSmall
from CNNMedium import CNNMedium


def main():
    archs = {
        'cnnsmall': CNNSmall,
        'cnnmedium': CNNMedium,
    }

    parser = argparse.ArgumentParser(description='Cifar-10 CNN example')
    parser.add_argument('--arch', '-a', choices=archs.keys(),
                        default='cnnsmall', help='Convnet architecture')
    parser.add_argument('--batchsize', '-b', type=int, default=64,
                        help='Number of images in each mini-batch')
    parser.add_argument('--epoch', '-e', type=int, default=20,
                        help='Number of sweeps over the dataset to train')
    parser.add_argument('--gpu', '-g', type=int, default=-1,
                        help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--out', '-o', default='result-cifar10',
                        help='Directory to output the result')
    parser.add_argument('--resume', '-r', default='',
                        help='Resume the training from snapshot')
    args = parser.parse_args()

    print('GPU: {}'.format(args.gpu))
    print('# Minibatch-size: {}'.format(args.batchsize))
    print('# epoch: {}'.format(args.epoch))
    print('')

    # 1. Setup model
    class_num = 10
    model = archs[args.arch](n_out=class_num)
    classifier_model = L.Classifier(model)
    if args.gpu >= 0:
        chainer.cuda.get_device(args.gpu).use()  # Make a specified GPU current
        classifier_model.to_gpu()  # Copy the model to the GPU

    # 2. Setup an optimizer
    optimizer = optimizers.Adam()
    optimizer.setup(classifier_model)

    # 3. Load the CIFAR-10 dataset
    train, test = chainer.datasets.get_cifar10()

    # 4. Setup an Iterator
    train_iter = iterators.SerialIterator(train, args.batchsize)
    test_iter = iterators.SerialIterator(test, args.batchsize,
                                         repeat=False, shuffle=False)

    # 5. Setup an Updater
    updater = training.StandardUpdater(train_iter, optimizer, device=args.gpu)
    # 6. Setup a trainer (and extensions)
    trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

    # Evaluate the model with the test dataset for each epoch
    trainer.extend(extensions.Evaluator(test_iter, classifier_model, device=args.gpu))

    trainer.extend(extensions.dump_graph('main/loss'))
    trainer.extend(extensions.snapshot(), trigger=(1, 'epoch'))
    trainer.extend(extensions.LogReport())
    trainer.extend(extensions.PrintReport(
        ['epoch', 'main/loss', 'validation/main/loss',
         'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))
    trainer.extend(extensions.PlotReport(
        ['main/loss', 'validation/main/loss'],
        x_key='epoch', file_name='loss.png'))
    trainer.extend(extensions.PlotReport(
        ['main/accuracy', 'validation/main/accuracy'],
        x_key='epoch',
        file_name='accuracy.png'))

    trainer.extend(extensions.ProgressBar())

    # Resume from a snapshot
    if args.resume:
        serializers.load_npz(args.resume, trainer)

    # Run the training
    trainer.run()
    serializers.save_npz('{}/{}-cifar10.model'
                         .format(args.out, args.arch), model)

if __name__ == '__main__':
    main()

See how clean the code is! Chainer abstracts the training process and thus the code can be reusable with other deep learning training.

[hands on] Try running train code.

Below is example in my environment

  • CNNSmall model
$ python train_cifar10.py -g 0 -o result-cifar10-cnnsmall -a cnnsmall
GPU: 0
# Minibatch-size: 64
# epoch: 20

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           1.66603     1.44016               0.397638       0.477807                  6.22123
2           1.36101     1.31731               0.511324       0.527568                  12.0878
3           1.23553     1.20439               0.559119       0.568073                  17.9239
4           1.14553     1.13121               0.589609       0.595541                  23.7497
5           1.08058     1.09946               0.617747       0.606588                  29.5948
6           1.02242     1.1259                0.638784       0.605295                  35.4604
7           0.97847     1.0797                0.65533        0.615048                  41.3058
8           0.938967    1.0584                0.669494       0.621815                  47.184
9           0.902363    1.00883               0.681985       0.646099                  53.0965
10          0.872796    1.00743               0.692782       0.644904                  58.982
11          0.838787    0.993791              0.705226       0.651971                  64.9511
12          0.813549    0.987916              0.714609       0.655454                  70.3869
13          0.785552    0.987968              0.723825       0.659236                  75.8247
14          0.766127    1.0092                0.730074       0.656748                  81.4311
15          0.743967    1.04623               0.738496       0.650876                  86.9175
16          0.723779    0.991238              0.744518       0.665008                  92.6226
17          0.704939    1.02468               0.752058       0.655354                  98.1399
18          0.68687     0.999966              0.756962       0.660629                  103.657
19          0.668204    1.00803               0.763564       0.660928                  109.226
20          0.650081    1.01554               0.769906       0.667197                  114.705

Chainer extension, PlotReport will automatically create the graph of loss and accuracy for each epoch.

We can achieve around 65% validation accuracy with such a easy CNN construction.

  • CNNMedium
$ python train_cifar10.py -g 0 -o result-cifar10-cnnmedium -a cnnmedium
GPU: 0
# Minibatch-size: 64
# epoch: 20

epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           1.62656     1.3921                0.402494       0.493133                  7.61706
2           1.31508     1.2771                0.526448       0.54588                   15.209
3           1.14961     1.12021               0.589749       0.603603                  22.7185
4           1.04442     1.05119               0.631182       0.629877                  30.1564
5           0.947944    1.00655               0.66624        0.648288                  37.8547
6           0.876341    1.0247                0.690021       0.644705                  46.9253
7           0.819997    0.983303              0.711968       0.662719                  54.9994
8           0.757557    0.933339              0.733795       0.677846                  62.4761
9           0.699673    0.948701              0.751539       0.682126                  69.8784
10          0.652811    0.965453              0.769006       0.680533                  77.2829
11          0.606698    0.990516              0.785551       0.671278                  84.6915
12          0.559568    0.999138              0.799996       0.682822                  92.068
13          0.521884    1.07451               0.814158       0.678742                  99.4703
14          0.477247    1.08184               0.829445       0.673865                  107.249
15          0.443625    1.08582               0.840109       0.680832                  114.609
16          0.406318    1.26192               0.853573       0.660529                  122.218
17          0.378328    1.2075                0.86507        0.670183                  129.655
18          0.349719    1.27795               0.87548        0.673467                  137.098
19          0.329299    1.32094               0.881702       0.664709                  144.553
20          0.297305    1.39914               0.894426       0.666202                  151.959

As expected, CNNMedium takes little bit longer time for computation but it achieves higher accuracy for training data.

※ It is also important to notice that validation accuracy is almost same between CNNSmall and CNNMedium, which means CNNMedium may be overfitting to the training data. To avoid overfitting, data augmentation (flip, rotate, clip, resize, add gaussian noise etc the input image to increase the effective data size) technique is often used in practice. 

Training CIFAR-100

Again, training CIFAR-100 is quite similar to the training of CIFAR-10.

See train_cifar100.py. Only the difference is model definition to set the output class number (model definition itself is not changed and can be reused!!).

    # 1. Setup model
    class_num = 100
    model = archs[args.arch](n_out=class_num)

and dataset preparation

    # 3. Load the CIFAR-10 dataset
    train, test = chainer.datasets.get_cifar100()


[hands on] Try running train code.

Summary

We have learned how to train CNN with Chainer. CNN is widely used many image processing tasks, not only image classification. For example,

  • Bounding Box detection
    • SSD, YoLo V2
  • Semantic segmentation
    • FCN
  • Colorization
    • PaintsChainer
  • Image generation
    • GAN
  • Style transfer
    • chainer goph
  • Super resolution
    • SeRanet

etc. Now you are ready to enter these advanced image processing with deep learning!

[hands on]

Try modifying the CNN model or create your own CNN model and train it to see the computational speed and its performance (accuracy). You may try changing following

  • model depth
  • channel size of each layer
  • Layer (Ex. use F.max_pooling_2d instead of L.Convolution2D with stride 2)
  • activation function (F.relu to F.leaky_reluF.sigmoidF.tanh etc…) 
  • Try inserting another layer, Ex. L.BatchNormalization or F.dropout.
    etc.

You can refer Chainer example codes to see the network definition examples.

Also, try configuring hyper parameter to see the performance

  • Change optimizer
  • Change learning rate of optimizer
    etc.

Next: CIFAR-10, CIFAR-100 inference code

CIFAR-10, CIFAR-100 dataset introduction

Source code is uploaded on github.

CIFAR-10 and CIFAR-100 are the small image datasets with its classification labeled. It is widely used for easy image classification task/benchmark in research community.

In Chainer, CIFAR-10 and CIFAR-100 dataset can be obtained with build-in function.

Setup code: 

from __future__ import print_function
import os
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import chainer

basedir = './src/cnn/images'

CIFAR-10

chainer.datasets.get_cifar10 method is prepared in Chainer to get CIFAR-10 dataset. Dataset is automatically downloaded from https://www.cs.toronto.edu only for the first time, and its cache is used from second time.

CIFAR10_LABELS_LIST = [
    'airplane', 
    'automobile',
    'bird',
    'cat',
    'deer',
    'dog',
    'frog',
    'horse',
    'ship',
    'truck'
]

train, test = chainer.datasets.get_cifar10()

The dataset structure is quite same with MNIST dataset, it is TupleDataset.
train[i] represents i-th data, there are 50000 training data.
test data structure is same, with 10000 test data.

print('len(train), type ', len(train), type(train))
print('len(test), type ', len(test), type(test))

len(train), type 50000 <class 'chainer.datasets.tuple_dataset.TupleDataset'>
len(test), type 10000 <class 'chainer.datasets.tuple_dataset.TupleDataset'>

train[i] represents i-th data, type=tuple \( (x_i, y_i) \), where \(x_i\) is image data and \(y_i\) is label data.

train[i][0] represents \(x_i\), CIFAR-10 image data, this is 3 dimensional array, (3, 32, 32), which represents RGB channel, width 32 px, height 32 px respectively.

train[i][1] represents \(y_i\), the label of CIFAR-10 image data (scalar), this is scalar value whose actual label can be converted by LABELS_LIST.

Let’s see 0-th data, train[0], in detail.

print('train[0]', type(train[0]), len(train[0]))

x0, y0 = train[0]
print('train[0][0]', x0.shape, x0)
print('train[0][1]', y0.shape, y0, '->', CIFAR10_LABELS_LIST[y0])

train[0] <class 'tuple'> 2 train[0][0] (3, 32, 32) [[[ 0.23137257 0.16862746 0.19607845 ..., 0.61960787 0.59607846 0.58039218] [ 0.0627451 0. 0.07058824 ..., 0.48235297 0.4666667 0.4784314 ] [ 0.09803922 0.0627451 0.19215688 ..., 0.46274513 0.47058827 0.42745101] ..., [ 0.81568635 0.78823537 0.77647066 ..., 0.627451 0.21960786 0.20784315] [ 0.70588237 0.67843139 0.72941178 ..., 0.72156864 0.38039219 0.32549021] [ 0.69411767 0.65882355 0.7019608 ..., 0.84705889 0.59215689 0.48235297]] [[ 0.24313727 0.18039216 0.18823531 ..., 0.51764709 0.49019611 0.48627454] [ 0.07843138 0. 0.03137255 ..., 0.34509805 0.32549021 0.34117648] [ 0.09411766 0.02745098 0.10588236 ..., 0.32941177 0.32941177 0.28627452] ..., [ 0.66666669 0.60000002 0.63137257 ..., 0.52156866 0.12156864 0.13333334] [ 0.54509807 0.48235297 0.56470591 ..., 0.58039218 0.24313727 0.20784315] [ 0.56470591 0.50588238 0.55686277 ..., 0.72156864 0.46274513 0.36078432]] [[ 0.24705884 0.17647059 0.16862746 ..., 0.42352945 0.40000004 0.4039216 ] [ 0.07843138 0. 0. ..., 0.21568629 0.19607845 0.22352943] [ 0.08235294 0. 0.03137255 ..., 0.19607845 0.19607845 0.16470589] ..., [ 0.37647063 0.13333334 0.10196079 ..., 0.27450982 0.02745098 0.07843138] [ 0.37647063 0.16470589 0.11764707 ..., 0.36862746 0.13333334 0.13333334] [ 0.45490199 0.36862746 0.34117648 ..., 0.54901963 0.32941177 0.28235295]]]
train[0][1] () 6 -> frog

def plot_cifar(filepath, data, row, col, scale=3., label_list=None):
    fig_width = data[0][0].shape[1] / 80 * row * scale
    fig_height = data[0][0].shape[2] / 80 * col * scale
    fig, axes = plt.subplots(row, 
                             col, 
                             figsize=(fig_height, fig_width))
    for i in range(row * col):
        # train[i][0] is i-th image data with size 32x32
        image, label_index = data[i]
        image = image.transpose(1, 2, 0)
        r, c = divmod(i, col)
        axes[r][c].imshow(image)  # cmap='gray' is for black and white picture.
        if label_list is None:
            axes[r][c].set_title('label {}'.format(label_index))
        else:
            axes[r][c].set_title('{}: {}'.format(label_index, label_list[label_index]))
        axes[r][c].axis('off')  # do not show axis value
    plt.tight_layout()   # automatic padding between subplots
    plt.savefig(filepath)
plot_cifar(os.path.join(basedir, 'cifar10_plot.png'), train, 4, 5, 
           scale=4., label_list=CIFAR10_LABELS_LIST)
plot_cifar(os.path.join(basedir, 'cifar10_plot_more.png'), train, 10, 10, 
           scale=4., label_list=CIFAR10_LABELS_LIST)

CIFAR-100

CIFAR-100 is really similar to CIFAR-10. The difference is the number of classified label is 100. chainer.datasets.get_cifar100 method is prepared in Chainer to get CIFAR-100 dataset.

CIFAR100_LABELS_LIST = [
    'apple', 'aquarium_fish', 'baby', 'bear', 'beaver', 'bed', 'bee', 'beetle', 
    'bicycle', 'bottle', 'bowl', 'boy', 'bridge', 'bus', 'butterfly', 'camel', 
    'can', 'castle', 'caterpillar', 'cattle', 'chair', 'chimpanzee', 'clock', 
    'cloud', 'cockroach', 'couch', 'crab', 'crocodile', 'cup', 'dinosaur', 
    'dolphin', 'elephant', 'flatfish', 'forest', 'fox', 'girl', 'hamster', 
    'house', 'kangaroo', 'keyboard', 'lamp', 'lawn_mower', 'leopard', 'lion',
    'lizard', 'lobster', 'man', 'maple_tree', 'motorcycle', 'mountain', 'mouse',
    'mushroom', 'oak_tree', 'orange', 'orchid', 'otter', 'palm_tree', 'pear',
    'pickup_truck', 'pine_tree', 'plain', 'plate', 'poppy', 'porcupine',
    'possum', 'rabbit', 'raccoon', 'ray', 'road', 'rocket', 'rose',
    'sea', 'seal', 'shark', 'shrew', 'skunk', 'skyscraper', 'snail', 'snake',
    'spider', 'squirrel', 'streetcar', 'sunflower', 'sweet_pepper', 'table',
    'tank', 'telephone', 'television', 'tiger', 'tractor', 'train', 'trout',
    'tulip', 'turtle', 'wardrobe', 'whale', 'willow_tree', 'wolf', 'woman',
    'worm'
]

train_cifar100, test_cifar100 = chainer.datasets.get_cifar100()

The dataset structure is quite same with MNIST dataset, it is TupleDataset.

train[i] represents i-th data, there are 50000 training data. Total train data is same size while the number of class label increased. So the training data for each class label is fewer than CIFAR-10 dataset.

test data structure is same, with 10000 test data.

print('len(train_cifar100), type ', len(train_cifar100), type(train_cifar100))
print('len(test_cifar100), type ', len(test_cifar100), type(test_cifar100))

print('train_cifar100[0]', type(train_cifar100[0]), len(train_cifar100[0]))

x0, y0 = train_cifar100[0]
print('train_cifar100[0][0]', x0.shape)  # , x0
print('train_cifar100[0][1]', y0.shape, y0)

len(train_cifar100), type 50000 <class 'chainer.datasets.tuple_dataset.TupleDataset'>
len(test_cifar100), type 10000 <class 'chainer.datasets.tuple_dataset.TupleDataset'>
train_cifar100[0] <class 'tuple'> 2
train_cifar100[0][0] (3, 32, 32)
train_cifar100[0][1] () 19

plot_cifar(os.path.join(basedir, 'cifar100_plot_more.png'), train_cifar100,
           10, 10, scale=4., label_list=CIFAR100_LABELS_LIST)

Next: CIFAR-10, CIFAR-100 training with Convolutional Neural Network

Understanding convolutional layer

Source code is uploaded on github.
The sample image is obtained from PEXELS.

What is the difference between convolutional layer and linear layer? What kind of intuition is in behind of using convolutional layer in deep neural network?

This hands on shows some effects by convolutional layer to provide some intution about what convolutional layer do.

import os

import numpy as np
import matplotlib.pyplot as plt
import cv2

%matplotlib inline

basedir = './src/cnn/images'


def read_rgb_image(imagepath):
    image = cv2.imread(imagepath)  # Height, Width, Channel
    (major, minor, _) = cv2.__version__.split(".")
    if major == '3':
        # version 3 is used, need to convert
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    else:
        # Version 2 is used, not necessary to convert
        pass
    return image


def read_gray_image(imagepath):
    image = cv2.imread(imagepath)  # Height, Width, Channel
    (major, minor, _) = cv2.__version__.split(".")
    if major == '3':
        # version 3 is used, need to convert
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        # Version 2 is used, not necessary to convert
        image = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    return image


def plot_channels(array, filepath='out.jpg'):
    """Plot each channel component separately

    Args:
        array (numpy.ndarray): 3-D array (width, height, channel)

    """
    ch_number = array.shape[2]

    fig, axes = plt.subplots(1, ch_number)
    for i in range(ch_number):
        # Save each image
        # cv2.imwrite(os.path.join(basedir, 'output_conv1_{}.jpg'.format(i)), array[:, :, i])
        axes[i].set_title('Channel {}'.format(i))
        axes[i].axis('off')
        axes[i].imshow(array[:, :, i], cmap='gray')

    plt.savefig(filepath)

Above type of diagram often appears in Convolutional neural network field. Below figure explains its notation.

Cuboid represents the “image” array where this image might not mean the meaningful picture. Horizontal axis represents channel number, vertical axis for image height and depth axis for image width respectively.

Convolution layer – basic usage

Input format of convolutional layer is in the order, (batch index, channel, height, width). Since openCV image format is in the order (height, width, channel), this dimension order need to be converted to input to convolution layer.

It can be done by using transpose method.

L.Convolution2D(in_channels, out_channels, ksize)

  • in_channels: input channel number.
  • out_channels: output channel number.
  • ksize: kernel size.

also, following parameters is often set

  • pad: padding
  • stride: stride

To understand the behavior of convolution layer, I recommend to see the animation on conv_arithmetic.

import chainer.links as L

# Read image from file, save image with matplotlib using `imshow` function
imagepath = os.path.join(basedir, 'sample.jpeg')

image = read_rgb_image(imagepath)

# height and width shows pixel size of this image 
# Channel=3 indicates the RGB channel 
print('image.shape (Height, Width, Channel) = ', image.shape)

conv1 = L.Convolution2D(None, 3, 5)

# Need to input image of the form (batch index, channel, height, width)
image = image.transpose(2, 0, 1)
image = image[np.newaxis, :, :, :]

# Convert from int to float
image = image.astype(np.float32)
print('image shape', image.shape)
out_image = conv1(image).data
print('shape', out_image.shape)
out_image = out_image[0].transpose(1, 2, 0)
print('shape 2', out_image.shape)
plot_channels(out_image,
              filepath=os.path.join(basedir, 'output_conv1.jpg'))

#plt.imshow(image)
#plt.savefig('./src/cnn/images/out.jpg')

image.shape (Height, Width, Channel) = (380, 512, 3)
image shape (1, 3, 380, 512)
shape (1, 3, 376, 508)
shape 2 (376, 508, 3)

Convolution2D layer takes 4-dim array as input and outputs 4-dim array. Graphical meaning of this input-output relation ship is drawn in below figure.

When the in_channels is set to None, its size is determined at the first time when it is used. i.e., out_image = conv1(image).data in above code.

The internal parameter W is initialized randomly at that time. As you can see, output_conv1.jpg shows the result after random filter is applied.

Some “feature” can be extracted by applying convolution layer.

For example, random fileter sometimes acts as “blurring” or “edge extracting” image.

To understand the intuitive meaning of convolutional layer in more detail, please see below example.

gray_image = read_gray_image(imagepath)
print('gray_image.shape (Height, Width) = ', gray_image.shape)

# Need to input image of the form (batch index, channel, height, width)
gray_image = gray_image[np.newaxis, np.newaxis, :, :]
# Convert from int to float
gray_image = gray_image.astype(np.float32)

conv_vertical = L.Convolution2D(1, 1, 3)
conv_horizontal = L.Convolution2D(1, 1, 3)

print(conv_vertical.W.data)
conv_vertical.W.data = np.asarray([[[[-1., 0, 1], [-1, 0, 1], [-1, 0, 1]]]])
conv_horizontal.W.data = np.asarray([[[[-1., -1, -1], [0, 0., 0], [1, 1, 1]]]])

print('image.shape', image.shape)
out_image_v = conv_vertical(gray_image).data
out_image_h = conv_horizontal(gray_image).data
print('out_image_v.shape', out_image_v.shape)
out_image_v = out_image_v[0].transpose(1, 2, 0)
out_image_h = out_image_h[0].transpose(1, 2, 0)
print('out_image_v.shape (after transpose)', out_image_v.shape)

cv2.imwrite(os.path.join(basedir, 'output_conv_vertical.jpg'), out_image_v[:, :, 0])
cv2.imwrite(os.path.join(basedir, 'output_conv_horizontal.jpg'), out_image_h[:, :, 0])

gray_image.shape (Height, Width) = (380, 512)
[[[[-0.17837302 0.2948513 -0.0661072 ]
    [ 0.02076577 -0.14251317 -0.05151904]
    [ 0.01675515 0.07612066 0.37937522]]]]
image.shape (1, 3, 380, 512)
out_image_v.shape (1, 1, 378, 510)
out_image_v.shape (after transpose) (378, 510, 1)

As you can see from the result, each convolution layer acts as emphasizing/extracting the color difference along specific direction. In this way “filter”, also called “kernel” can be considered as feature extractor.

Convolution with stride

The default value of stride is 1. If this value is specified, convolution layer will reduce output image size.

Practically, stride=2 is often used to generate the output image of the height & width almost half of the input image.

print('image.shape (Height, Width, Channel) = ', image.shape)

conv2 = L.Convolution2D(None, 5, 3, 2)

print('input image.shape', image.shape)
out_image = conv2(conv1(image)).data
print('out_image.shape', out_image.shape)
out_image = out_image[0].transpose(1, 2, 0)
plot_channels(out_image,
              filepath=os.path.join(basedir, 'output_conv2.jpg'))

image.shape (Height, Width, Channel) = (1, 3, 380, 512)
input image.shape (1, 3, 380, 512)
out_image.shape (1, 5, 187, 253)

As written in the Chainer docs, the input and output shape relation is given in below formula:

$$ w_O = (w_I + 2w_P – w_K) / s_X + 1 $$

where each symbol means that

  • \(h\): height
  • \(w\): width
  • \(I\): input
  • \(O\): output
  • \(P\): padding
  • \(K\): kernel size

Max pooling

Convolution layer with stride can be used to look wide range feature, another popular method is to use max pooling.

Max pooling function extracts the maximum value in the kernel, and it dispose the rest pixel’s information.

This behavior is beneficial to impose translational symmetry. For example, consider the dog’s picture. Even if the each pixel shifted one pixel, is should be still recognized as dog. So translational symmetry can be exploited to reduce model’s calculation time and number of internal parameters for image classification task.

from chainer import functions as F
print('image.shape (Height, Width, Channel) = ', image.shape)

print('input image.shape', image.shape)
out_image = F.max_pooling_2d(image, 2).data
print('out_image.shape', out_image.shape)
out_image = out_image[0].transpose(1, 2, 0)
plot_channels(out_image,
              filepath=os.path.join(basedir, 'output_max_pooling.jpg'))

image.shape (Height, Width, Channel) = (1, 3, 380, 512)
input image.shape (1, 3, 380, 512)
out_image.shape (1, 3, 190, 256)

Convolutional neural network

By combining above functions with non-linear activation units, Convolutional Neural Network (CNN) can be constructed.

For non-linear activation, reluleaky_relusigmoid or tanh are often used.

class SimpleCNN(chainer.Chain):
    def __init__(self):
        super().__init__(
            conv1=L.Convolution2D(None, 5, 3),
            conv2=L.Convolution2D(None, 5, 3),
        )
        
    def __call__(self, x):
        h = F.relu(conv1(x))
        h = F.max_pooling_2d(h, 2)
        h = F.relu(conv2(h))
        h = F.max_pooling_2d(h, 2)
        return h
  
model = SimpleCNN()
print('input image.shape', image.shape)
out_image = model(image).data
print('out_image.shape', out_image.shape)
out_image = out_image[0].transpose(1, 2, 0)
plot_channels(out_image,
              filepath=os.path.join(basedir, 'output_simple_cnn.jpg'))

input image.shape (1, 3, 380, 512)
out_image.shape (1, 5, 47, 63)

Let’s see how this CNN can be used for image classification in the following. Before that, next post explains CIFAR-10, CIFAR-100 dataset which are famous image classification dataset for research.

Next: CIFAR-10, CIFAR-100 dataset introduction

Basic image processing tutorial

Basic image processing for deep learning. Refer github for the source code.
The sample image is obtained from PEXELS.

If you are not familiar with image processing, you can read this article before going to convolutional neural network.

OpenCV is image processing library which supports

  • loading image in numpy.ndarray format, save image
  • converting image color format (RGB, YUV, Gray scale etc)
  • resize

and other useful image processing functionality.

To install opencv, execute

$conda install -c https://conda.binstar.org/menpo -y opencv3

import os

import matplotlib.pyplot as plt
import cv2

%matplotlib inline

def readRGBImage(imagepath):
    image = cv2.imread(imagepath)  # Height, Width, Channel
    (major, minor, _) = cv2.__version__.split(".")
    if major == '3':
        # version 3 is used, need to convert
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    else:
        # Version 2 is used, not necessary to convert
        pass
    return image

Loading and save image

  • cv2.imread for loading image.
  • cv2.imwrite for save image.
  • plt.imshow for plotting, and plt.savefig for save plot image.

OpenCV image format is usually 3 dimension (or 2 dimension if the image is gray scale).

1st dimension is for height, 2nd dimension is for width, 3rd dimension is for channel (RGB, YUV etc).

To convert color format cv2.cvtColor can be used. Details are written in next section.

# Read image from file, save image with matplotlib using `imshow` function
basedir = './src/cnn/images'
imagepath = os.path.join(basedir, 'sample.jpeg')
#image = cv2.imread(imagepath, cv2.IMREAD_GRAYSCALE)
image = readRGBImage(imagepath)
# Width and Height shows pixel size of this image 
# Channel=3 indicates the RGB channel 
print('image.shape (Height, Width, Channel) = ', image.shape)
# Save image with openCV
# This may be blue image because the color format RGB is opposite.
cv2.imwrite('./src/cnn/images/out.jpg', image)  
# bgr_image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
# cv2.imwrite('./src/cnn/images/out.jpg', bgr_image)  
# Plotting
plt.imshow(image)
plt.savefig('./src/cnn/images/out_plt.png')

image.shape (Height, Width, Channel) = (380, 512, 3)

out_plt.jpg

Change color format

  • cv2.cvtColor for converting color format.

Note that openCV version 3 reads the image color in the order B, G, R. However, matplotlib deals with the image color in the corder R, G, B. So you need to convert color order, refer readRGBImage function.

If the image is gray scale, the image is 2 dimensional array

1st dimension is for height, 2nd dimension is for width.

gray_image = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)

# Gray scale image is 2 dimension, No channel dimension. 
print('gray_image.shape (Height, Width) = ', gray_image.shape)
cv2.imwrite('./src/cnn/images/out_gray.jpg', gray_image)

gray_image.shape (Height, Width) = (380, 512)

out_gray.jpg

Resize

  • cv2.imread for resize.

Note that size should be specified in the order width, height.

%matplotlib inline
print('image.shape (Height, Width, Channel) = ', image.shape)

# Resize image to half size
height, width = image.shape[:2]
half_image = cv2.resize(image, (width//2, height//2))  # size must be int
print('half_image.shape (Height, Width, Channel) = ', half_image.shape)
plt.imshow(half_image)
plt.savefig('./src/cnn/images/out_half.jpeg')


# Resize image by specifying longer side size
def resize_longedge(image, pixel):
    """Resize the input image
    
    Longer edge size will be `pixel`, and aspect ratio doesn't change
    """
    height, width = image.shape[:2]
    longer_side = max(height, width)
    ratio = float(pixel) / longer_side
    return cv2.resize(image, None, fx=ratio, fy=ratio)  # size must be int

resized128_image = resize_longedge(image, 128)
print('resized128_image.shape (Height, Width, Channel) = ', resized128_image.shape)
plt.imshow(resized128_image)
plt.savefig('./src/cnn/images/out_resized128.jpg')

image.shape (Height, Width, Channel) = (380, 512, 3)
half_image.shape (Height, Width, Channel) = (190, 256, 3)
resized128_image.shape (Height, Width, Channel) = (95, 128, 3)

out_resized128.jpg

Crop

  • numpy slicing can be used for cropping image
# Crop center of half_image 

height, width = half_image.shape[:2]

crop_length = min(height, width)

height_start = (height - crop_length) // 2
width_start = (width - crop_length) // 2

cropped_image = half_image[
                height_start:height_start+crop_length, 
                width_start:width_start+crop_length,
                :] 
print('cropped_image.shape (Height, Width, Channel) = ', cropped_image.shape)
plt.imshow(cropped_image)
plt.savefig('./src/cnn/images/out_cropped.jpg')

cropped_image.shape (Height, Width, Channel) = (190, 190, 3)

Image processing with channels

RGB channel manipulation.

Understanding the meaning of “channel” is important in deep learning. Below code provides some insight that what each channel represents.

%matplotlib inline
# Show RGB channel separately in gray scale

fig, axes = plt.subplots(1, 3)

# image[:, :, 0] is R channel.
axes[0].set_title('R channel')
axes[0].imshow(image[:, :, 0], cmap='gray')
# image[:, :, 1] is G channel.
axes[1].set_title('G channel')
axes[1].imshow(image[:, :, 1], cmap='gray')
# image[:, :, 2] is B channel.
axes[2].set_title('B channel')
axes[2].imshow(image[:, :, 2], cmap='gray')
plt.savefig(os.path.join(basedir, 'RGB_gray.jpg'))
RGB_gray.jpg
# Show RGB channel separately in color
fig, axes = plt.subplots(1, 3)

# image[:, :, 0] is R channel, replace the rest by 0.
imageR = image.copy()
imageR[:, :, 1:3] = 0
axes[0].set_title('R channel')
axes[0].imshow(imageR)

# image[:, :, 1] is G channel, replace the rest by 0.
imageG = image.copy()
imageG[:, :, [0, 2]] = 0
axes[1].set_title('G channel')
axes[1].imshow(imageG)

# image[:, :, 2] is B channel, replace the rest by 0.
imageB = image.copy()
imageB[:, :, 0:2] = 0
axes[2].set_title('B channel')
axes[2].imshow(imageB)
plt.savefig(os.path.join(basedir, 'RGB_color.jpg'))
RGB_color.jpg Each R,G,B channel is shown in R, G, B color respectively.

Next: Understanding convolutional layer

MNIST inference code

We already learned how to write training code in chainer, the last task is to use this trained model to inference (predict) the test input MNIST image.

Inference code structure usually becomes as follows,

  • Prepare input data
  • Instantiate the trained model
  • Load the trained model
  • Feed input data into loaded model to get inference result

You have already learned the necessary stuff, and it is easy. See inference_mnist.py for the source code.

Prepare input data

For MNIST, it is easy in one line

    # Load the MNIST dataset
    train, test = chainer.datasets.get_mnist()

Instantiate the trained model and load the model

    # Load trained model
    model = mlp.MLP(args.unit, 10)
    if args.gpu >= 0:
        chainer.cuda.get_device(args.gpu).use()  # Make a specified GPU current
        model.to_gpu()  # Copy the model to the GPU
    xp = np if args.gpu < 0 else cuda.cupy

    serializers.load_npz(args.modelpath, model)

Here, note that model can be loaded after instantiating the model. This model must have the same structure (hidden unit size, layer depth etc) when you saved the model in training stage.

Feed input data into loaded model to get inference result

Below code is to get inference result y from test input data x.

    for i in range(len(test)):
        x = Variable(xp.asarray([test[i][0]]))    # test data
        # t = Variable(xp.asarray([test[i][1]]))  # labels
        y = model(x)                              # Inference result

Visualize the result

You might want to see the inference result together with the input image to understand more precisely. This code draws a plot for test input image and its inference result.

    """Original code referenced from https://github.com/hido/chainer-handson"""
    ROW = 4
    COLUMN = 5
    # show graphical results of first 20 data to understand what's going on in inference stage
    plt.figure(figsize=(15, 10))
    for i in range(ROW * COLUMN):
        # Example of predicting the test input one by one.
        x = Variable(xp.asarray([test[i][0]]))  # test data
        # t = Variable(xp.asarray([test[i][1]]))  # labels
        y = model(x)
        np.set_printoptions(precision=2, suppress=True)
        print('{}-th image: answer = {}, predict = {}'.format(i, test[i][1], F.softmax(y).data))
        prediction = y.data.argmax(axis=1)
        example = (test[i][0] * 255).astype(np.int32).reshape(28, 28)
        plt.subplot(ROW, COLUMN, i+1)
        plt.imshow(example, cmap='gray')
        plt.title("No.{0} / Answer:{1}, Predict:{2}".format(i, test[i][1], prediction))
        plt.axis("off")
    plt.tight_layout()
    plt.savefig('inference.png')

Even only 50 hidden units are used, the accuracy to inference the MNIST digit number is quite high.

That’s all for MNIST dataset tutorial. Now you have learned the basics of how to use deep learning framework. how to write training code, how to write inference code with Chainer. It is now ready to go further to specialized category. Convolutional Neural Network is used in wide area especially Image processing, Reccurent Neural Network is Language processing etc.

Chainer family

[Updated on May 31: Add detail description for ChainerCV & ChainerMN]

Recently several sub-libraries for Chainer are released,

ChainerRL

RL: Reinforcement Learning

Deep Reinforcement Learning library.

Atari 2600 play
cite from http://chainer.org/general/2017/02/22/ChainerRL-Deep-Reinforcement-Learning-Library.html

Recent state-of-the-art deep reinforcement algorithms are implemented, including

  • A3C (Asynchronous Advantage Actor-Critic)
  • ACER (Actor-Critic with Experience Replay) (only the discrete-action version for now)
  • Asynchronous N-step Q-learning
  • DQN (including Double DQN, Persistent Advantage Learning (PAL), Double PAL, Dynamic Policy Programming (DPP))
  • DDPG (Deep Deterministic Poilcy Gradients) (including SVG(0))
  • PGT (Policy Gradient Theorem) 

How to install

pip install chainerrl

ChainerCV

CV: Computer Vision

Image processing library for deep learning training. Common data-augmentation are implemented. Also the trained models for Bounding box detection and semantic segmentation are provided. 

How to install

pip install chainercv 

ChainerMN

MN: Multi Node

Distributed deep learning framework for Chainer.

cite from https://research.preferred.jp/2017/05/chainermn-beta-release/

It was announced at Deep Learning Summit 2017 that training time for ImageNet classification task took 4.4 hours (ResNet-50, 100 Epochs, 128 GPUs), which is fastest among other distributed deep learning frameworks known to date.

Reference

How to install

You need to install CUDA-aware API and NCCL beforehand, and then,

pip install cython
pip install chainermn

Writing organized, reusable, clean training code using Trainer module

Training code abstraction with Trainer

Until now, I was implementing the training code in “primitive” way to explain what kind of operations are going on in deep learning training (※). However, the code can be written in much clean way using Trainer modules in Chainer.

※ Trainer modules are implemented from version 1.11, and some of the open source projects are implemented without Trainer. So it helps to understand these codes by knowing the training implementation without Trainer module as well.

Motivation for using Trainer

We can notice there are many “typical” operations widely used in machine learning, for example

  • Iterating minibatch training, with minibatch sampled ramdomly
  • Separate train data & validation data, validation is used only for checking the loss to prevent overfitting
  • Output the log, save the trained model in regular interval

These operations are commonly applied, and Chainer provides these features in library level so that user don’t need to implement again and again. Trainer will mange the training code for you!

Details are also explained in official document of Trainer.

Source code with Trainer

Seeing is better than hearing, train_mnist_4_trainer.py is the source code which uses Trainer module. If I remove the comment, the source code looks like below,

from future import print_function
import argparse

import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training
from chainer.training import extensions
from chainer import serializers

import mlp as mlp

def main():
parser = argparse.ArgumentParser(description=’Chainer example: MNIST’)
parser.add_argument(‘–batchsize’, ‘-b’, type=int, default=100,
help=’Number of images in each mini-batch’)
parser.add_argument(‘–epoch’, ‘-e’, type=int, default=20,
help=’Number of sweeps over the dataset to train’)
parser.add_argument(‘–gpu’, ‘-g’, type=int, default=-1,
help=’GPU ID (negative value indicates CPU)’)
parser.add_argument(‘–out’, ‘-o’, default=’result/4′,
help=’Directory to output the result’)
parser.add_argument(‘–resume’, ‘-r’, default=”,
help=’Resume the training from snapshot’)
parser.add_argument(‘–unit’, ‘-u’, type=int, default=50,
help=’Number of units’)
args = parser.parse_args()

from __future__ import print_function
import argparse

import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training
from chainer.training import extensions
from chainer import serializers

import mlp as mlp


def main():
    parser = argparse.ArgumentParser(description='Chainer example: MNIST')
    parser.add_argument('--batchsize', '-b', type=int, default=100,
                        help='Number of images in each mini-batch')
    parser.add_argument('--epoch', '-e', type=int, default=20,
                        help='Number of sweeps over the dataset to train')
    parser.add_argument('--gpu', '-g', type=int, default=-1,
                        help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--out', '-o', default='result/4',
                        help='Directory to output the result')
    parser.add_argument('--resume', '-r', default='',
                        help='Resume the training from snapshot')
    parser.add_argument('--unit', '-u', type=int, default=50,
                        help='Number of units')
    args = parser.parse_args()

    print('GPU: {}'.format(args.gpu))
    print('# unit: {}'.format(args.unit))
    print('# Minibatch-size: {}'.format(args.batchsize))
    print('# epoch: {}'.format(args.epoch))
    print('')

    model = mlp.MLP(args.unit, 10)
    classifier_model = L.Classifier(model)
    if args.gpu >= 0:
        chainer.cuda.get_device(args.gpu).use()  # Make a specified GPU current
        classifier_model.to_gpu()  # Copy the model to the GPU

    optimizer = chainer.optimizers.Adam()
    optimizer.setup(classifier_model)

    train, test = chainer.datasets.get_mnist()

    train_iter = chainer.iterators.SerialIterator(train, args.batchsize)
    test_iter = chainer.iterators.SerialIterator(test, args.batchsize, repeat=False, shuffle=False)

    updater = training.StandardUpdater(train_iter, optimizer, device=args.gpu)
    trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

    trainer.extend(extensions.Evaluator(test_iter, classifier_model, device=args.gpu))
    trainer.extend(extensions.dump_graph('main/loss'))
    trainer.extend(extensions.snapshot(), trigger=(1, 'epoch'))
    trainer.extend(extensions.LogReport())
    trainer.extend(extensions.PrintReport(
        ['epoch', 'main/loss', 'validation/main/loss',
         'main/accuracy', 'validation/main/accuracy', 'elapsed_time']))
    trainer.extend(extensions.ProgressBar())

    if args.resume:
        # Resume from a snapshot
        serializers.load_npz(args.resume, trainer)

    trainer.run()
    serializers.save_npz('{}/mlp.model'.format(args.out), model)

if __name__ == '__main__':
    main()


See how clean the code is! Compare above code and train_mnist_2_predictor_classifier.py. The code even does not contains for loop, as well as random permutation for minibatch, and save function explicitly.

The code length also become shorten almost half, even it supports more functionality than previous train_mnist_2_predictor_classifier.py code,

  • Calculating validadtion loss, accuracy
  • Save trainer snapshot in regular interval (it is including optimizer and model data.)
    You can pause and resume training.
  • Print log in formatted way, together with the progress bar which showing training status.
  • Output the training result to log file in json formatted text.

However it has changed much from previous code, user might not understand what’s going on. Several modules are used for together with the Trainer. Let’s see overview of the role for each module one by one.

  • Dataset
  • Interator
  • Updater
  • Trainer
    • extensions
      – Evaluator
      – LogReport
      – PrintReport
      – ProgressBar
      – snapshot
      – dump_graph

More detail functionality and usage are explained later.

Dataset

Input data should be prepared in Dataset format so that Iterator can handle.

In this example, dataset does not explicitly appear but already prepared

    train, test = chainer.datasets.get_mnist()

 This train and test is TupleDataset. Recall MNIST dataset introduction.

There are several Dataset classes, TupleDatasetImageDataset etc and even you can define your custom Dataset class by using DatasetMixin.

All the Dataset follows common rule that when data is Dataset instance data[i] points the i-th data.

Usually it consists of input data and target data (answer), where data[i][0] is the i-th input data, data[i][1] is the i-th target data. However, it can be only one element or even more than 2 elements depending on the problem.

Role: Used for preparing input value to provide index access of data. Specifically i-th data can be accessed by data[i], so that Iterator can handle.

See also official document

Iterator

For loop of training minibatch is replaced and managed by Iterator.

    train_iter = chainer.iterators.SerialIterator(train, args.batchsize)

This one line provides almost same with following training loop,

    # Learning loop
    for epoch in six.moves.range(1, n_epoch + 1):
        # training
<strong>        perm = np.random.permutation(N)
        for i in six.moves.range(0, N, batchsize):
            x = chainer.Variable(xp.asarray(train[perm[i:i + batchsize]][0]))
            t = chainer.Variable(xp.asarray(train[perm[i:i + batchsize]][1]))
</strong>
            # Pass the loss function (Classifier defines it) and its arguments
            optimizer.update(classifier_model, x, t)

and in the same way applies for validation (test) dataset,

    test_iter = chainer.iterators.SerialIterator(test, args.batchsize, repeat=False, shuffle=False)

is for

        for i in six.moves.range(0, N_test, batchsize):
            index = np.asarray(list(range(i, i + batchsize)))
            x = chainer.Variable(xp.asarray(test[index][0]), volatile='on')
            t = chainer.Variable(xp.asarray(test[index][1]), volatile='on')
            loss = classifier_model(x, t)

minibatch random sampling, implemented by np.permutation can be replaced by just setting shuffle flag to True or False (default True).

Currently 2 Iterator classes are provided, 

  • SerialIterator is the most basic class.
  • MultiProcessIterator provides multi process data preparation support in background.

Both of them have the

Role: Construct minibatch from Dataset (including background preparation support using multi process), and pass it to Updater.

See also official document

Updater

After creating Iterator, it is set to Updater together with optmizer,

    updater = training.StandardUpdater(train_iter, optimizer, device=args.gpu)

Updater is in charge of calling optimizer’s update function, which means it corresponds to call

            # Pass the loss function (Classifier defines it) and its arguments
            optimizer.update(classifier_model, x, t)

Currently 2 Updater classes are (and 1 Updater will be) provided, 

  • StandardUpdater is the basic class.
  • ParallelUpdater is for utilizing multiple GPU at the same time.

Role: Receiving minibatch from Iterator, calculate loss and call optimizer’s update.

Currently I could not find official document for Updater, but you can refer source code docstring,

Trainer

Finally, Trainer instance can be created via Updater

trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

to start training, just call run

trainer.run()

Usually extensions are registered before start calling run of trainer, see below

Role: Manages Training lifecycle. extension can be registered.

Trainer extension

Trainer extension can be registered by trainer.extend() function.

These extensions are used in this example,

  • Evaluator
    Calculate Validation loss and accuracy, and it is printed out and logged to file.
  • LogReport
    Print out log file in json format, in the directory specified by out argument in trainer.
  • PrintReport
    Print out log in standard out (console) to show training status.
  • ProgressBar
    Show progress bar to show current progress of training.
  • snapshot
    Save the trainer state (including model, optimizer information) in regular interval.
    By setting this extension, you can pause and resume training.
  • dump_graph
    dumps neural network computational graph

Role: hook trigger to trainer to do several events in specific timing

Trainer architecture summary

Refer above figure for the training abstraction procedure using Trainer module.

Advantage of using Trainer module

– Multi process data preparation using MultiProcessIterator

Python has GIL feature, so even you use multi-thread its threads are not executed in “parallel”. If the code contains heavy data preprocessing (e.g. data augmentation, adding noise before feeding as input) you can get benefit by using MultiProcessIterator.

– Multiple GPU utilization

 – ParallelUpdater or MultiProcessParallelUpdater

– Trainer extensions are useful and reusable once you made your own extension

 – PrintReport

 – ProgressBar

 – LogReport

 — The log is in json format, it is easy to load and plot learning curve graph etc.

 – snapshot

 etc etc… Why don’t we use it!

Next: MNIST inference code