MNIST training with Multi Layer Perceptron


[Update 2017.06.11]  Add chainer v2 code.

Training MNIST

You already studied basics of Chainer and MNIST dataset. Now we can proceed to the MNIST classification task. We want to create a classifier that classifies MNIST handwritten image into its digit. In other words, classifier will get array which represents MNIST image as input and outputs its label.

※ Chainer contains modules called Trainer, Iterator, Updater, which makes your training code more organized. It is quite nice to write your training code by using them in higher level syntax. However, its abstraction makes difficult to understand what is going on during the training. For those who want to learn deep learning in more detail, I think it is nice to know “primitive way” of writing training code. Therefore, I intentionally don’t to use these modules at first to explain training code.

The source code below is based on

[hands on] Before going to read the explanation, try to execute If you are using IDE like pycharm, just press run button. If you are going to run from command line, go to src directory first and execute

You can see the log like below, indicating that the loss in decreasing through the training and accuracy is increasing.

GPU: -1
# unit: 50
# Minibatch-size: 100
# epoch: 20
out directory: result/1_minimum
epoch 1
train mean loss=0.41262895802656807, accuracy=0.8860333333909511, throughput=54883.71423542936 images/sec
test  mean loss=0.21496000131592155, accuracy=0.9357000035047531
epoch 2
train mean loss=0.1967763691022992, accuracy=0.942733335296313, throughput=66559.17396858479 images/sec
test  mean loss=0.17020921929739416, accuracy=0.9499000030755996
epoch 3
train mean loss=0.1490274258516729, accuracy=0.9558166695634523, throughput=66375.93210754421 images/sec
test  mean loss=0.1352944350033067, accuracy=0.9595000040531159

Of course, it is ok that you may not understand the meaning of this log here. I will explain the detail one by one in the following.

Define Network and loss function

Let’s adopt Multi Layer Perceptron (MLP), which is a most simple neural network, as our model. This is written as follows with Chainer,

[Memo] In chainer v1, it was written as follows,


This model is graphically drawn as follows. All nodes are fully connected, and the network with this kinds of structure is called MLP (Multi layer perceptron).

The fast part is input layer and the last part is output layer. The rest middle part of the layer is called “hidden layer”. This example contains only 1 hidden layer, but hidden layers may exist more than 1 in general (If you construct the network deeper, the number of hidden layer increases). 


MLP (Multi Layer Perceptron)



As written in __call__ function, it will take x (array indicating image) as input and return y (indicating predicted probability for each label) as output.

However, this is not enough for training the model. We need loss function to be optimized. In classification task, softmax cross entropy loss is often used.

Output of Linear layer can take arbitrary real number, Softmax function converts it into between 0-1, thus we can consider it as “probability for this label”. Cross entropy is to calculate loss between two probability distributions. Chainer has utility function F.softmax_cross_entropy(y, t) to calculate softmax of y followed by cross entropy with t. Loss will be smaller if the probability distribution predicted as y is equal to the actual probability distribution t. Intuitively, loss decreases when model can predict correct label given image.

Here I will skip more detail explanation, please study by yourself. Here is some reference,

To calculate softmax cross entropy loss, we define another Chain class, named SoftmaxClassifier as follows,

Then, the model is instantiated as

First, MLP model is created. n_out is set to 10 because MNIST has 10 patterns, from 0 until 9, in label. Then classifier_model is created based on the MLP model as its predictor. As you can see here, Network of Chain class can be “chained” to construct new network which is also Chain class. I guess this is the reason the name “Chainer” comes from.

Once loss function calculation is defined in __call__ function of model, you can set this model into Optimizer to proceed training.

As already explained at Chainer basic module introduction 2, training proceeds by calling

This code will calculate the loss as classifier_model(x, t) and tune (optimize) internal paramaters of model with Optimizer’s algorithm (Adam in this case).

Note that Back propagation is done automatically inside this update code, so you don’t need to write these codes explicitly.

As explain below, we will pass x and t in minibatch unit.

Use GPU 

Chainer support GPU for calculation speed-up. To use GPU, PC must have NVIDIA GPU and you need to install CUDA, and better to install cudnn followed by installing chainer.

To write GPU compatible code, just add these 3 lines.


You need to set gpu device id in variable gpu.

If you don’t use gpu, set gpu=-1, which indicates not to use GPU and only use CPU. In that case numpy (written as np in above) is used for array calculation.

If you want to use gpu, set gpu=0 etc (usual consumer PC with NVIDIA GPU contains one GPU core, thus only gpu device id=0 can be used. GPU cluster have several GPUs (0, 1, 2, 3 etc) in one PC). In this case, call chainer.cuda.get_device(gpu).use() for specifying which GPU device to be used and model.to_gpu() to copy model’s internal parameters into GPU. In this case cupy is used for array calculation.


In python science calculation, numpy is widely used for vector, matrix and general tensor calculation. numpy will optimize these linear calculation with CPU automatically. cupy can be considered as GPU version of numpy, so that you can write GPU calculation code almost same with numpy. cupy is developed by Chainer team, as chainer.cuda.cupy in Chainer version 1.

However, cupy itself can be used as GPU version of numpy, thus applicable to more wide use case, not only for chainer. So cupy will be independent from chainer, and provided as cupy module from Chainer version 2.

GPU performance

How much different if GPU can be used? Below table show the image throughput with the model’s hidden layer unit size


CPU: Intel Core i7-6700 K

         2816 CUDA core
         1 GHz base clock

How many times faster?
1000 5500 38000 ×6.9
3000 700 13000 ×18.6

When the neural network size is large, many calculation can be parallelized and GPU advantage affect more. In some cases, GPU is about 20 times faster than CPU.

[Hands on] If you have NVIDIA GPU, compare the performance between CPU & GPU in your environment.

Train and Evaluation (Test) consists of 2 phase, training phase and evaluation (test) phase.

In regression/classification task in machine learning, you need to verify the model’s generalization performance. Even loss is decreasing with training dataset, it is not always true that loss for test (unseen) dataset is small.

Especially, we should take care overfitting problem. To cooperate this, you can check test dataset loss also decreases through the training.

Training phase

  • optimizer.update code will update model’s internal parameter to decrease loss.
  • Random permutation is to get random sample for constructing minibatch.

If training loss is not decreasing from the beginning, the root cause may be bug or some hyper parameter setting is wrong. When training loss stops decreasing (saturated), it is ok to stop the training.



Evaluation (test) phase

  • We must not call optimizer.update code. Test dataset is considered as unseen data for model. Should not be included as training information.
  • We don’t need to take random permutation in test phase, only sum_loss and sum_accuracy is necessary.

Evaluation code does (should have) no affect to model. This is just to check loss for test dataset. Ideal pattern is of course test loss decreases through the training.

If this test loss is not decreasing while training loss is decreasing, it is a sign that model is overfitting. Then, you need to take action

  • Increase the data size (if possible).
    – Data augmentation is one method to increase the data effectively.
  • Decrease the number of internal parameters in neural network
    – Try more simple network
  • Add Regularization term


Put all codes together,



Next: Refactoring MNIST training

Sponsored Links

Leave a Reply

Your email address will not be published.