[Update 2017.06.11] Add chainer v2 code.

Contents

## Training MNIST

You already studied basics of Chainer and MNIST dataset. Now we can proceed to the MNIST classification task. We want to create a classifier that classifies MNIST handwritten image into its digit. In other words, classifier will get array which represents MNIST image as input and outputs its label.

※ Chainer contains modules called `Trainer`

, `Iterator`

, `Updater`

, which makes your training code more organized. It is quite nice to write your training code by using them in higher level syntax. However, its abstraction makes difficult to understand what is going on during the training. For those who want to learn deep learning in more detail, I think it is nice to know “primitive way” of writing training code. Therefore, I intentionally don’t to use these modules at first to explain training code.

The source code below is based on `train_mnist_1_minimum.py`

.

[hands on] Before going to read the explanation, try to execute `train_mnist_1_minimum.py`

. If you are using IDE like pycharm, just press run button. If you are going to run from command line, go to `src`

directory first and execute

1 |
python mnist/train_mnist_1_minimum.py |

You can see the log like below, indicating that the loss in decreasing through the training and accuracy is increasing.

`GPU: -1`

`# unit: 50`

`# Minibatch-size: 100`

`# epoch: 20`

`out directory: result/1_minimum`

`epoch 1`

`train mean loss=0.41262895802656807, accuracy=0.8860333333909511, throughput=54883.71423542936 images/sec`

`test mean loss=0.21496000131592155, accuracy=0.9357000035047531`

`epoch 2`

`train mean loss=0.1967763691022992, accuracy=0.942733335296313, throughput=66559.17396858479 images/sec`

`test mean loss=0.17020921929739416, accuracy=0.9499000030755996`

`epoch 3`

`train mean loss=0.1490274258516729, accuracy=0.9558166695634523, throughput=66375.93210754421 images/sec`

`test mean loss=0.1352944350033067, accuracy=0.9595000040531159`

`...`

Of course, it is ok that you may not understand the meaning of this log here. I will explain the detail one by one in the following.

## Define Network and loss function

Let’s adopt Multi Layer Perceptron (MLP), which is a most simple neural network, as our model. This is written as follows with Chainer,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
class MLP(chainer.Chain): """Neural Network definition, Multi Layer Perceptron""" def __init__(self, n_units, n_out): super(MLP, self).__init__() with self.init_scope(): # the size of the inputs to each layer will be inferred when `None` self.l1 = L.Linear(None, n_units) # n_in -> n_units self.l2 = L.Linear(None, n_units) # n_units -> n_units self.l3 = L.Linear(None, n_out) # n_units -> n_out def __call__(self, x): h1 = F.relu(self.l1(x)) h2 = F.relu(self.l2(h1)) y = self.l3(h2) return y |

[Memo] In chainer v1, it was written as follows,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Neural Network definition, Multi Layer Perceptron class MLP(chainer.Chain): def __init__(self, n_units, n_out): super(MLP, self).__init__( # the size of the inputs to each layer will be inferred l1=L.Linear(None, n_units), # n_in -> n_units l2=L.Linear(None, n_units), # n_units -> n_units l3=L.Linear(None, n_out), # n_units -> n_out ) def __call__(self, x): h1 = F.relu(self.l1(x)) h2 = F.relu(self.l2(h1)) y = self.l3(h2) return y |

This model is graphically drawn as follows. All nodes are fully connected, and the network with this kinds of structure is called MLP (Multi layer perceptron).

The fast part is input layer and the last part is output layer. The rest middle part of the layer is called “hidden layer”. This example contains only 1 hidden layer, but hidden layers may exist more than 1 in general (If you construct the network deeper, the number of hidden layer increases).

As written in `__call__`

function, it will take `x`

(array indicating image) as input and return `y`

(indicating predicted probability for each label) as output.

However, this is not enough for training the model. We need **loss function** to be optimized. In classification task, softmax cross entropy loss is often used.

Output of Linear layer can take arbitrary real number, **Softmax function** converts it into between 0-1, thus we can consider it as “probability for this label”. **Cross entropy** is to calculate loss between two probability distributions. Chainer has utility function `F.softmax_cross_entropy(y, t)`

to calculate softmax of `y`

followed by cross entropy with `t`

. Loss will be smaller if the probability distribution predicted as `y`

is equal to the actual probability distribution `t`

. Intuitively, loss decreases when model can predict correct label given image.

Here I will skip more detail explanation, please study by yourself. Here is some reference,

To calculate softmax cross entropy loss, we define another Chain class, named `SoftmaxClassifier`

as follows,

1 2 3 4 5 6 7 8 9 10 11 |
class SoftmaxClassifier(chainer.Chain): def __init__(self, predictor): super(SoftmaxClassifier, self).__init__( predictor=predictor ) def __call__(self, x, t): y = self.predictor(x) self.loss = F.softmax_cross_entropy(y, t) self.accuracy = F.accuracy(y, t) return self.loss |

Then, the model is instantiated as

1 2 3 4 5 6 |
unit = 50 # Number of hidden layer units ... # Set up a neural network to train model = MLP(unit, 10) # Classifier will calculate classification loss, based on the output of model classifier_model = SoftmaxClassifier(model) |

First, MLP `model`

is created. `n_out`

is set to 10 because MNIST has 10 patterns, from 0 until 9, in label. Then `classifier_model`

is created based on the MLP `model`

as its `predictor`

. As you can see here, Network of Chain class can be “chained” to construct new network which is also Chain class. I guess this is the reason the name “Chainer” comes from.

Once loss function calculation is defined in `__call__`

function of model, you can set this model into `Optimizer`

to proceed training.

1 2 3 |
# Setup an optimizer optimizer = chainer.optimizers.Adam() optimizer.setup(classifier_model) |

As already explained at Chainer basic module introduction 2, training proceeds by calling

1 2 |
# Pass the loss function (Classifier defines it) and its arguments optimizer.update(classifier_model, x, t) |

This code will calculate the loss as `classifier_model(x, t)`

and tune (optimize) internal paramaters of model with Optimizer’s algorithm (Adam in this case).

Note that Back propagation is done automatically inside this update code, so you don’t need to write these codes explicitly.

As explain below, we will pass `x`

and `t`

in **minibatch** unit.

## Use GPU

Chainer support GPU for calculation speed-up. To use GPU, PC must have NVIDIA GPU and you need to install CUDA, and better to install cudnn followed by installing chainer.

To write GPU compatible code, just add these 3 lines.

1 2 3 4 |
if gpu >= 0: chainer.cuda.get_device(gpu).use() # Make a specified GPU current classifier_model.to_gpu() # Copy the model to the GPU xp = np if gpu < 0 else cuda.cupy |

You need to set gpu device id in variable `gpu`

.

If you don’t use gpu, set `gpu=-1`

, which indicates not to use GPU and only use CPU. In that case `numpy`

(written as `np`

in above) is used for array calculation.

If you want to use gpu, set `gpu=0`

etc (usual consumer PC with NVIDIA GPU contains one GPU core, thus only gpu device id=0 can be used. GPU cluster have several GPUs (0, 1, 2, 3 etc) in one PC). In this case, call `chainer.cuda.get_device(gpu).use()`

for specifying which GPU device to be used and `model.to_gpu()`

to copy model’s internal parameters into GPU. In this case

is used for array calculation.**cupy**

### cupy

In python science calculation, `numpy`

is widely used for vector, matrix and general tensor calculation. `numpy`

will optimize these linear calculation with CPU automatically. ** cupy** can be considered as GPU version of

`numpy`

, so that you can write **GPU**calculation code almost same with

`numpy`

. `cupy`

is developed by Chainer team, as `chainer.cuda.cupy`

in Chainer version 1.However, cupy itself can be used as GPU version of numpy, thus applicable to more wide use case, not only for chainer. So `cupy`

will be independent from chainer, and provided as cupy module from Chainer version 2.

### GPU performance

How much different if GPU can be used? Below table show the image throughput with the model’s hidden layer unit size

unit |
CPU: Intel Core i7-6700 K |
GPU: NVIDIA 980 Ti |
How many times faster? |

1000 | 5500 | 38000 | ×6.9 |

3000 | 700 | 13000 | ×18.6 |

When the neural network size is large, many calculation can be parallelized and GPU advantage affect more. In some cases, GPU is about **20 times faster** than CPU.

[Hands on] If you have NVIDIA GPU, compare the performance between CPU & GPU in your environment.

## Train and Evaluation (Test)

`train_mnist_1_minimum.py`

consists of 2 phase, training phase and evaluation (test) phase.

In regression/classification task in machine learning, you need to verify the model’s generalization performance. Even loss is decreasing with training dataset, it is not always true that loss for test (unseen) dataset is small.

Especially, we should take care overfitting problem. To cooperate this, you can check test dataset loss also decreases through the training.

### Training phase

`optimizer.update`

code will update model’s internal parameter to decrease loss.- Random permutation is to get random sample for constructing minibatch.

If training loss is not decreasing from the beginning, the root cause may be bug or some hyper parameter setting is wrong. When training loss stops decreasing (saturated), it is ok to stop the training.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# training perm = np.random.permutation(N) sum_accuracy = 0 sum_loss = 0 start = time.time() for i in six.moves.range(0, N, batchsize): x = chainer.Variable(xp.asarray(train[perm[i:i + batchsize]][0])) t = chainer.Variable(xp.asarray(train[perm[i:i + batchsize]][1])) # Pass the loss function (Classifier defines it) and its arguments optimizer.update(classifier_model, x, t) sum_loss += float(classifier_model.loss.data) * len(t.data) sum_accuracy += float(classifier_model.accuracy.data) * len(t.data) end = time.time() elapsed_time = end - start throughput = N / elapsed_time print('train mean loss={}, accuracy={}, throughput={} images/sec'.format( sum_loss / N, sum_accuracy / N, throughput)) |

### Evaluation (test) phase

- We
**must not**call`optimizer.update`

code. Test dataset is considered as unseen data for model. Should not be included as training information. - We don’t need to take random permutation in test phase, only
`sum_loss`

and`sum_accuracy`

is necessary.

Evaluation code does (should have) no affect to model. This is just to check loss for test dataset. Ideal pattern is of course test loss decreases through the training.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# evaluation sum_accuracy = 0 sum_loss = 0 for i in six.moves.range(0, N_test, batchsize): index = np.asarray(list(range(i, i + batchsize))) x = chainer.Variable(xp.asarray(test[index][0])) t = chainer.Variable(xp.asarray(test[index][1])) loss = classifier_model(x, t) sum_loss += float(loss.data) * len(t.data) sum_accuracy += float(classifier_model.accuracy.data) * len(t.data) print('test mean loss={}, accuracy={}'.format( sum_loss / N_test, sum_accuracy / N_test)) |

If this test loss is not decreasing while training loss is decreasing, it is a sign that model is overfitting. Then, you need to take action

- Increase the data size (if possible).

– Data augmentation is one method to increase the data effectively. - Decrease the number of internal parameters in neural network

– Try more simple network - Add Regularization term

Put all codes together,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
""" Very simple implementation for MNIST training code with Chainer using Multi Layer Perceptron (MLP) model This code is to explain the basic of training procedure. """ from __future__ import print_function import time import os import numpy as np import six import chainer import chainer.functions as F import chainer.links as L from chainer import cuda from chainer import serializers class MLP(chainer.Chain): """Neural Network definition, Multi Layer Perceptron""" def __init__(self, n_units, n_out): super(MLP, self).__init__( # the size of the inputs to each layer will be inferred l1=L.Linear(None, n_units), # n_in -> n_units l2=L.Linear(None, n_units), # n_units -> n_units l3=L.Linear(None, n_out), # n_units -> n_out ) def __call__(self, x): h1 = F.relu(self.l1(x)) h2 = F.relu(self.l2(h1)) y = self.l3(h2) return y class SoftmaxClassifier(chainer.Chain): """Classifier is for calculating loss, from predictor's output. predictor is a model that predicts the probability of each label. """ def __init__(self, predictor): super(SoftmaxClassifier, self).__init__( predictor=predictor ) def __call__(self, x, t): y = self.predictor(x) self.loss = F.softmax_cross_entropy(y, t) self.accuracy = F.accuracy(y, t) return self.loss def main(): # Configuration setting gpu = -1 # GPU ID to be used for calculation. -1 indicates to use only CPU. batchsize = 100 # Minibatch size for training epoch = 20 # Number of training epoch out = 'result/1_minimum' # Directory to save the results unit = 50 # Number of hidden layer units, try incresing this value and see if how accuracy changes. print('GPU: {}'.format(gpu)) print('# unit: {}'.format(unit)) print('# Minibatch-size: {}'.format(batchsize)) print('# epoch: {}'.format(epoch)) print('out directory: {}'.format(out)) # Set up a neural network to train model = MLP(unit, 10) # Classifier will calculate classification loss, based on the output of model classifier_model = SoftmaxClassifier(model) if gpu >= 0: chainer.cuda.get_device(gpu).use() # Make a specified GPU current classifier_model.to_gpu() # Copy the model to the GPU xp = np if gpu < 0 else cuda.cupy # Setup an optimizer optimizer = chainer.optimizers.Adam() optimizer.setup(classifier_model) # Load the MNIST dataset train, test = chainer.datasets.get_mnist() n_epoch = epoch N = len(train) # training data size N_test = len(test) # test data size # Learning loop for epoch in range(1, n_epoch + 1): print('epoch', epoch) # training perm = np.random.permutation(N) sum_accuracy = 0 sum_loss = 0 start = time.time() for i in six.moves.range(0, N, batchsize): x = chainer.Variable(xp.asarray(train[perm[i:i + batchsize]][0])) t = chainer.Variable(xp.asarray(train[perm[i:i + batchsize]][1])) # Pass the loss function (Classifier defines it) and its arguments optimizer.update(classifier_model, x, t) sum_loss += float(classifier_model.loss.data) * len(t.data) sum_accuracy += float(classifier_model.accuracy.data) * len(t.data) end = time.time() elapsed_time = end - start throughput = N / elapsed_time print('train mean loss={}, accuracy={}, throughput={} images/sec'.format( sum_loss / N, sum_accuracy / N, throughput)) # evaluation sum_accuracy = 0 sum_loss = 0 for i in six.moves.range(0, N_test, batchsize): index = np.asarray(list(range(i, i + batchsize))) x = chainer.Variable(xp.asarray(test[index][0])) t = chainer.Variable(xp.asarray(test[index][1])) loss = classifier_model(x, t) sum_loss += float(loss.data) * len(t.data) sum_accuracy += float(classifier_model.accuracy.data) * len(t.data) print('test mean loss={}, accuracy={}'.format( sum_loss / N_test, sum_accuracy / N_test)) # Save the model and the optimizer if not os.path.exists(out): os.makedirs(out) print('save the model') serializers.save_npz('{}/classifier_mlp.model'.format(out), classifier_model) serializers.save_npz('{}/mlp.model'.format(out), model) print('save the optimizer') serializers.save_npz('{}/mlp.state'.format(out), optimizer) if __name__ == '__main__': main() |