Library release: visualize saliency map of deep neural network

Japanese is available at Qiita.

From left: 1. Classification saliency map visualization of VGG16, CNN model. 2. iris dataset feature importance calculation of MLP model. 3. Water solubility contribution visualization of Graph convolutional network model.


Have you ever thought “Deep neural network is highly complicated black box, no one ever able to see what happens inside to result this output.”?

Even though NN consists of many layers and its mathematical analysis is difficult, there are some researches to show some saliency map like above images to understand model’s behavior or to get new knowledge for the dataset.

These saliency map calculation methods are implemented in Chainer Chemistry (even though the name contains “chemistry”, saliency module is available in many domains, as explained below). I will briefly explain how these work, and how to use it. You can also show these visualization figures after read this (a little bit long) article, enjoy!

It starts from theoretical explanation, followed by the code to use the module. Please jump to the “Examples” section if you just want to use it.

The code in this article is uploaded on github


What is reasoning of NN?

3 saliency calculation methods are implemented so far in chainer chemistry.

– VanillaGrad
– IntegratedGradient
– Occlusion

These methods calculate the contribution to the model’s prediction for each data.

※ Note that feature importance used in Random forest or XGBoost are calculated for the model. There is a difference that it is not calculated for “each data”.

Brief introduction – VanillaGrad

This method calculates derivative of output y with respect to input x, as a input contribution to the output prediction.

$$ s_i = \frac{dy}{dx_i} $$

Here, \(s_i\) is the saliency score, which is calculated for each input data’s \(i\)-th element \(x_i\). When the value of gradient is large for some element, the value change of this element results in big change of output prediction. So this element should have larger saliency (importance).

In terms of implementation, it is simply written as follows with chainer.

y = model(x)
s = x.grad

Saliency module usage

Calculator class calculates saliency score, like VaillaGrad, IntegratedGradient, or Occlusion.

Visualizer class visualizes calculated saliency score.

Calculator can be used with various NN model, which does not restrict the domain or application. Visualizer can be implemented to adopt Application for the proper visualization for the domain.

Basic usage flow is to call Calculator computeaggregate -> Visualizer visualize 

# model is chainer.Chain, x is dataset
calculator = GradientCalculator(model)
saliency_samples = calculator.compute(x)
saliency = calculator.aggregate(saliency_samples)

visualizer = ImageVisualizer()

Calculator class

Here I use GradientCalculator as an example which calcultes VanillaGrad explained above. Let’s see how to call each method.


Instance with passing model, which is the target neural network to calculate saliency.

calculator = GradientCalculator(model)

compute method

compute method calculates “saliency samples” for each data x.

# x (bs, num_feature) -> saliency_samples (M, bs, num_feature)
saliency_samples = calculator.compute(x, M=1)

Here, M samples of saliency is calculated.

When calculating VanillaGrad, it suffices with M=1 since the calculation result of grad is always same. However, sampling is necessary when we consider SmoothGrad or BayesGrad.

I will explain SmoothGrad & BayesGrad to understand the notion of sampling.

– SmoothGrad –

Practically, VanillaGrad tends to show Noisy saliency map, so SmoothGrad suggests to change
input x to shift a small ϵ, resulting input x+ϵ and calculate grad. We can take the average as the final saliency score.

$$s_{mi} = \frac{dy}{dx_i} |_{x=x+\epsilon_m}$$

$$s_{i} = \frac{1}{M} \sum_{m=1}^{M}{s_{mi}}$$

In the library, compute method calculates saliency sample \(s_{mi}\), and aggregate method calculates saliency

$$s_i = \frac{1}{M} \sum_{m}^{M} s_{mi}$$

– project page:

– BayesGrad –

SmoothGrad changed input x by adding Gaussian noise, to take sampling. BayesGrad considers sampling along Neural Network parameter \(\theta\), trained with dataset D, to get prediction posterior distribution \(y_\theta \sim p(\theta|D)\) to take the sampling as follows:

$$ s_{mi} = \frac{dy_\theta}{dx_i} |_{\theta \sim p(\theta|D)} $$

$$ s_{i} = \frac{1}{M} \sum_{m=1}^{M}{s_{mi}} $$

– paper:

– code:

aggregate method

This method “aggregates” M saliency samples \(s_{mi}\) calculated by compute method, to obtain saliency \(s_i\). 

# saliency_samples (M, bs, num_feature) -> saliency (bs, num_feature)
saliency = calculator.aggregate(saliency_samples, method='raw')

Aggregation methods differ by paper by paper, aggregate method in the library supports following 3 method.

‘raw’: simply take average

$$ s_i = \frac{1}{M} \sum_{m}^{M} s_{mi} $$

‘abs’: take absolute average

$$ s_i = \frac{1}{M} \sum_{m}^{M} |s_{mi}| $$

‘square’: take squared average

$$ s_i = \frac{1}{M} \sum_{m}^{M} s_{mi}^2 $$

Visualizer class

It visualizes saliency from Calcualtor class.

– TableVisualizer: plot feature importance for each table data
– ImageVisualizer: plot saliency map of image 
– MolVisualizer: plot saliency map of molecule

As shown, Visualizer differs for each application.

visualize method

Visualizer plots figure with visualize method.

Note that Calculator class calcultes saliency with batch, but visualizer visualizes one data, so you need to specify it.

# Instantiation
visualizer = ImageVisualizer()

# Visualize `i`-th data
i = 0

The figure can be saved by setting save_filepath argument.

# Save saliency map
visualizer.visualize(saliency[i], save_filepath='saliency.png')


It was a long explanation,,, now let’s use it!

Table data application: calculate feature importance

Neural Network is MLP (Multi Layer Parceptron), Dataset is iris dataset provided by sklearn.

iris dataset is to classify 3 flower species ‘setosa’, ‘versicolor’, ‘virginica’, from 4 features ‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’.

# model
from chainer.functions import relu, dropout
from chainer_chemistry.models.mlp import MLP
from chainer_chemistry.models.prediction.classifier import Classifier

def activation_relu_dropout(h):
return dropout(relu(h), ratio=0.5)

out_dim = len(iris.target_names)
predictor = MLP(out_dim=out_dim, hidden_dim=48, n_layers=2, activation=activation_relu_dropout)
classifier = Classifier(predictor)

# dataset
import sklearn
from sklearn import datasets
import numpy as np
from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset

iris = datasets.load_iris()
# All dataset is to train for simplicity
dataset = NumpyTupleDataset(,
train = dataset

Model’s training code is omitted (please refer the code on github). After training the model, we can use saliency module.

First, use Calculator compute -> aggregate to calculate saliency.

from chainer_chemistry.saliency.calculator.gradient_calculator import GradientCalculator

# 1. instantiation
gradient_calculator = GradientCalculator(classifier)
# 2. compute
saliency_samples_vanilla = gradient_calculator.compute(train, M=1)
# 3. aggregate
saliency_vanilla = gradient_calculator.aggregate(
saliency_samples_vanilla, ch_axis=None, method='square')

Second, use Visualizer visualize method to plot figure.

from chainer_chemistry.saliency.visualizer.table_visualizer import TableVisualizer
from chainer_chemistry.saliency.visualizer.common import normalize_scaler

visualizer = TableVisualizer()
# Visualize saliency of `i`-th data
i = 0
visualizer.visualize(saliency_vanilla[i], feature_names=iris.feature_names,

We can see how the each feature contributes to the final output prediction loss.

We saw saliency for 0-th data above, now we can calculate average along dataset to show feature importance for all data (which roughly corresponds to model’s feature importance).

saliency_mean = np.mean(saliency_vanilla, axis=0)
visualizer.visualize(saliency_mean, feature_names=iris.feature_names, num_visualize=-1,

We can see “petal length” and “petal width” are more important. (note that the result differs according to the model’s training condition, be careful.)

To check above result is plausible, I tried to plot feature impotance of Random Forest from sklearn (code).

Even though the absolute importance value differs, its order is same. So I feel the saliency calculation of NN is also useful for feature selection etc 🙂

Image data: show saliency map for classification task

Training CNN takes time, so I will use pre-trained model. I will use VGG16 model provided by Chainer this time.

from import VGG16Layers

predictor = VGG16Layers()

It automatically download pretrained parameters, with only this code.

ImageNet correct label name is downloaded from here.

import numpy as np

with open('imagenet1000_clsid_to_human.txt') as f:
lines = f.readlines()

def extract_value(s):
quote_str = s[s.index(':') + 2]
return s[s.find(quote_str)+1:s.rfind(quote_str)]

classes = np.array([extract_value(line) for line in lines])

classes is 1000 class correct label as follows:

array(['tench, Tinca tinca', 'goldfish, Carassius auratus',
'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',
'tiger shark, Galeocerdo cuvieri', 'hammerhead, hammerhead shark',
'electric ray, crampfish, numbfish, torpedo', 'stingray', 'cock', ...

The images used in inference are downloaded from Pexels under CC0 license.

– Basketball image
– Bus image
– Dog image

Let’s try prediction at first.

from PIL import Image
import numpy as np

import chainer
import chainer.functions as F

# basketball, bus, dog
image_paths = ['./input/pexels-photo-945471.jpeg', './input/pexels-photo-45923.jpeg',

imgs = [ for fp in image_paths]
x = xp.asarray([ for img in imgs])
with chainer.using_config('train', False):
result = predictor.forward(x, layers=['prob'])
prob = result['prob']

lables_pred = np.argsort(cuda.to_cpu(prob.array), axis=1)[:, ::-1]

for i in range(len(lables_pred)):
print('i', i, 'labels_pred', lables_pred[i, :5], classes[lables_pred[i, :5]])
i 0 classes ['basketball' 'punching bag, punch bag, punching ball, punchball'
'rugby ball' 'barrel, cask' 'barbell']
i 1 classes ['trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi'
'passenger car, coach, carriage'
'streetcar, tram, tramcar, trolley, trolley car'
'fire engine, fire truck' 'trolleybus, trolley coach, trackless trolley']
i 2 classes ['basenji' 'Pembroke, Pembroke Welsh corgi' 'Ibizan hound, Ibizan Podenco'
'dingo, warrigal, warragal, Canis dingo' 'kelpie']

When we see the result, 1-st image is correctly predicted as Basketball, 2nd image is predicted as trailer truck though it is actually bus, 3rd image is predicted as basenji (ImageNet contains various dog’s species as label, I do not know this is indeed correct or not…).


So let’s proceed to saliency calculation. This time, I will calculate saliency for “why predicting the label of top prediction”, not for the ground truth label. For example in 2nd image, we calculate saliency for why the CNN model predicted “trailer truck”, so the ground truth label (and the model predicts correct label or not) is not related.

I can set output_var as “softmax cross entropy between top prediction label” (instead of ground truth label).

import chainer.functions as F
from chainer import cuda

def eval_fun(x):
result = predictor(x, layers=['fc8'])
out = result['fc8']
xp = cuda.get_array_module(out.array)
labels_pred = xp.argmax(out.array, axis=1).astype(xp.int32)
loss = F.softmax_cross_entropy(out, labels_pred)
return loss

Once eval_fun is defined, we can follow usual step: Calculator compute -> aggregate, ImageVisualizer visualize, to see the result.

from chainer_chemistry.saliency.calculator.gradient_calculator import GradientCalculator

# 1. instantiation
gradient_calculator = GradientCalculator(predictor, eval_fun=eval_fun, device=device)
# --- VanillaGrad ---
# 2. compute
saliency_samples_vanilla = gradient_calculator.compute(x)

# 3. aggregate
saliency_vanilla = gradient_calculator.aggregate(
saliency_samples_vanilla, ch_axis=2, method='abs')

# saliency_samples (1, 3, 3, 224, 224) -> M, minibatch, ch, h, w
print('saliency_samples', saliency_samples_vanilla.shape)
# saliency (3, 224, 224) -> minibatch, h, w
print('saliency', saliency_vanilla.shape)

We set ch_axis=2 in aggregate method, this is different from usual (minibatch, ch, h, w) image shape, because sampling_axis is added in front

ImageVisualizer visualization result is as follows:

from chainer_saliency.visualizer.image_visualizer import ImageVisualizer

visualizer = ImageVisualizer()

for index in range(len(saliency_vanilla)):
image = imgs[index].resize(saliency_vanilla[index].shape)
visualizer.visualize(saliency_vanilla[index], image, show_colorbar=False)

It looks the model focuses on right place,,, but it is too noisy to see the result.


Next, let’s calculate SmoothGrad. We can set noise_sampler argument in Calculator compute method.

from chainer_chemistry.saliency.calculator.common import GaussianNoiseSampler

M = 30

# --- SmoothGrad ---
# 2. compute
saliency_samples_smooth = gradient_calculator.compute(x, M=M, noise_sampler=GaussianNoiseSampler())

# 3. aggregate
saliency_smooth = gradient_calculator.aggregate(
saliency_samples_smooth, ch_axis=2, method='abs')

for index in range(len(saliency_vanilla)):
image = imgs[index].resize(saliency_smooth[index].shape)
visualizer.visualize(saliency_smooth[index], image, show_colorbar=False)

aggregatevisualize methods are same with VanillaGrad.

The figure looks much better, we can see model focuses on the edge of objects.


At last, we will try BayesGrad. It requires that the model has stochastic operation. This time, VGG16 has dropout operation so it is applicable.

To calculate BayesGrad, we only need to set train=True in Calculator compute method. Chainer automatically enables dropout so that output is different in each samples, results that we can calculate saliency samples (gradient) for prediction distribution.

M = 30
# --- BayesGrad ---
# 2. compute
saliency_samples_bayes = gradient_calculator.compute(x, M=M, train=True)

This time, the result is similar to VanillaGrad.

When I try combining both SmoothGrad & BayesGrad, the result are as follows:

Molecule data: plot property contribution map for regression task

For regression task, we can calculate saliency to consider its sign, to show that the input contributes to positive or negative to the prediction. 

In this last example, I will use Graph convolution model in Chainer Chemistry, to visualize water solubility contribution for each atom.

ESOL dataset is used for water solubility dataset.

import numpy as np
import chainer
from chainer.functions import relu, dropout

from chainer_chemistry.models.ggnn import GGNN
from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset
from chainer_chemistry.datasets.zinc import get_zinc250k
from chainer_chemistry.dataset.preprocessors.ggnn_preprocessor import GGNNPreprocessor
from chainer_chemistry.models.mlp import MLP
from chainer_chemistry.models.prediction.regressor import Regressor

# Model
def activation_relu_dropout(h):
    return dropout(relu(h), ratio=0.25)

class GraphConvPredictor(chainer.Chain):
    def __init__(self, graph_conv, mlp=None):
        """Initializes the graph convolution predictor.
            graph_conv: The graph convolution network required to obtain
                        molecule feature representation.
            mlp: Multi layer perceptron; used as the final fully connected
                 layer. Set it to `None` if no operation is necessary
                 after the `graph_conv` calculation.
        super(GraphConvPredictor, self).__init__()
        with self.init_scope():
            self.graph_conv = graph_conv
            if isinstance(mlp, chainer.Link):
                self.mlp = mlp
        if not isinstance(mlp, chainer.Link):
            self.mlp = mlp

    def __call__(self, atoms, adjs):
        x = self.graph_conv(atoms, adjs)
        if self.mlp:
            x = self.mlp(x)
        return x

n_unit = 32
conv_layers = 4
class_num = 1
device = 0  # -1 for CPU

ggnn = GGNN(out_dim=n_unit, hidden_dim=n_unit, n_layers=conv_layers)
mlp = MLP(out_dim=class_num, hidden_dim=n_unit, activation=activation_relu_dropout)
predictor = GraphConvPredictor(ggnn, mlp)
regressor = Regressor(predictor, device=device)

# Dataset
preprocessor = GGNNPreprocessor()

result = get_molnet_dataset('delaney', preprocessor, labels=None, return_smiles=True)
train = result['dataset'][0]
smiles = result['smiles'][0]

After training the model (see repository for the code), we can proceed to visualization.

This time, we want to focus on contribution to the output prediction instead of loss. So we can define eval_fun to set output_var as predictor‘s output.

Also, we need to take care that input x is label of the node, gradient is not propagated until this input, we need to adopt gradient of the variable after embed layer, which is hidden layer’s variable.

In this kind of case, to set target_var as intermediate variable in the model, we can use VariableMonitorLinkHook.

I use IntegratedGradientsCalculator this time, to calculate saliency:

import chainer.functions as F

from chainer_chemistry.saliency.calculator.gradient_calculator import GradientCalculator
from chainer_chemistry.saliency.calculator.integrated_gradients_calculator import IntegratedGradientsCalculator
from chainer_chemistry.link_hooks.variable_monitor_link_hook import VariableMonitorLinkHook

def eval_fun(x, adj, t):
    pred = predictor(x, adj)
    pred_summed = F.sum(pred)
    return pred_summed

# 1. instantiation
calculator = IntegratedGradientsCalculator(
predictor, steps=5, eval_fun=eval_fun, target_extractor=VariableMonitorLinkHook(ggnn.embed, timing='post'),

Visualization results are as follows,

from chainer_chemistry.saliency.visualizer.mol_visualizer import SmilesVisualizer
from chainer_chemistry.saliency.visualizer.common import abs_max_scaler

visualizer = SmilesVisualizer()
# 2. compute
saliency_samples_vanilla = calculator.compute(
train, M=1, converter=concat_mols)
method = 'raw'
saliency_vanilla = calculator.aggregate(
saliency_samples_vanilla, ch_axis=3, method=method)
i = 153
visualizer.visualize(saliency_vanilla[i], smiles[i])

Red color shows the positive effect on solubility (Hydrophilic), blue color shows the negative effect on solubility (Hydrophobic).

Above figure matches the common sense of Hydrophilic effects usually occurs at polarization exists (OH), and we can see Hydrophobic effects where C-chain continues.


I introduced saliency module, which is highly flexible and applicable to any domain.

You can try all the examples with few machine resources, only with CPU, so please try!! (Saliency map visualization of image uses pre-trained model so only inference is necessary).

Penn Tree Bank (PTB) dataset introduction

This post is based on the jupyter notebook ptb_dataset_introduction.ipynb uploaded on github.

Penn Treebank dataset, known as PTB dataset, is widely used in machine learning of NLP (Natural Language Processing) research.

Dataset if provided by the official page: Treebank-3

In Chainer, PTB dataset can be obtained with build-in function.

Let’s see the dataset structure.

from __future__ import print_function
import os
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import chainer

Download PTB dataset

chainer.datasets.get_ptb_words method is prepared in Chainer to get PTB dataset. Dataset is automatically downloaded from only for the first time, and its cache is used from second time.

train, val, test = chainer.datasets.get_ptb_words()

The dataset structure is numpy.ndarray.

train[i] represents i-th word in integer, which represents word ID.

print('train type: ', type(train), train.shape, train)
print('val   type: ', type(val), val.shape, val)
print('test  type: ', type(test), test.shape, test)

train type: <class 'numpy.ndarray'> (929589,) [ 0 1 2 ..., 39 26 24]
val type: <class 'numpy.ndarray'> (73760,) [2211 396 1129 ..., 108 27 24]
test type: <class 'numpy.ndarray'> (82430,) [142 78 54 ..., 87 214 24]

Word ID and word correspondence

Each word ID corresponds to specific word or symbol.

Symbol includes following

  • <eos> : end of sequence
  • <unk> : unknown word (I guess it is the word which was not in the 10000 vocabulary).

The relation between word ID and actual word can be obtained as dictionary with chainer.datasets.get_ptb_words_vocabulary()method.

ptb_dict = chainer.datasets.get_ptb_words_vocabulary()
print('Number of vocabulary', len(ptb_dict))
print('ptb_dict', ptb_dict)

Number of vocabulary 10000
ptb_dict {'representation': 7975, 'competent': 9733, 'unusual': 2825, 're-election': 2672, 'brewing': 7045, 'stunning': 9451, 'distributed': 6252, 'percentage': 72, 'compare': 2549, 'laughing': 3407, 'sci': 3311, 'suggested': 2611, 'incompetent': 9769, 'sandinistas': 9108, 'werner': 8877, 'poison': 6210, 'salon': 3963, 'now': 145, 'crest': 8679, 'dairy': 2018, 'lineup': 9597, 'hills': 1264, 'chip': 1157, 'creditor': 1374, 'actor': 2315, 'specialist': 2737, "'s": 119, 'flooded': 3700, 'aba': 3364, ... }

Convert to word sequences

Check original sentense by converting back word ID to word using ptb dictionary.

Train text

It is same with

ptb_word_id_dict = ptb_dict
ptb_id_word_dict = dict((v,k) for k,v in ptb_word_id_dict.items())

# Same with
print([ptb_id_word_dict[i] for i in train[:30]])

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old']

Now you can see that the sequence of word id is indeed a list of word which forms a meaningful text.

But list representation is little bit difficult to read for human, let’s convert to natural text using ' '.join() method.

# ' '.join() will convert list representation more readable

' '.join([ptb_id_word_dict[i] for i in train[:300]])

"aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter <eos> pierre <unk> N years old will join the board as a nonexecutive director nov. N <eos> mr. <unk> is chairman of <unk> n.v. the dutch publishing group <eos> rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate <eos> a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported <eos> the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said <eos> <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N <eos> although preliminary findings were reported more than a year ago the latest results appear in today 's new england journal of medicine a forum likely to bring new attention to the problem <eos> a <unk> <unk> said this is an old story <eos> we 're talking about years ago before anyone heard of asbestos having any questionable properties <eos> there is no asbestos in our products now <eos> neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes <eos> we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute <eos> dr. <unk> led a team of researchers from the national cancer institute and the medical schools of harvard university and boston university <eos> the <unk>"

Validation data text

It is same with

print(' '.join([ptb_id_word_dict[i] for i in val[:300]]))

consumers may want to move their telephones a little closer to the tv set <eos> <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> <eos> two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues <eos> and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show <eos> interactive telephone technology has taken a new leap in <unk> and television programmers are racing to exploit the possibilities <eos> eventually viewers may grow <unk> with the technology and <unk> the cost <eos> but right now programmers are figuring that viewers who are busy dialing up a range of services may put down their <unk> control <unk> and stay <unk> <eos> we 've been spending a lot of time in los angeles talking to tv production people says mike parks president of call interactive which supplied technology for both abc sports and nbc 's consumer minutes <eos> with the competitiveness of the television market these days everyone is looking for a way to get viewers more excited <eos> one of the leaders behind the expanded use of N numbers is call interactive a joint venture of giants american express co. and american telephone & telegraph co <eos> formed in august the venture <unk> at&t 's newly expanded N service with N <unk> computers in american express 's omaha neb. service center <eos> other long-distance carriers have also begun marketing enhanced N service and special consultants are <unk> up to exploit the new tool <eos> blair entertainment a new york firm that advises tv stations and sells ads for them has just formed a

Test data text

It is same with

print(' '.join([ptb_id_word_dict[i] for i in test[:300]]))

no it was n't black monday <eos> but while the new york stock exchange did n't fall apart friday as the dow jones industrial average plunged N points most of it in the final hour it barely managed to stay this side of chaos <eos> some circuit breakers installed after the october N crash failed their first test traders say unable to cool the selling panic in both stocks and futures <eos> the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure <eos> big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say <eos> heavy selling of standard & poor 's 500-stock index futures in chicago <unk> beat stocks downward <eos> seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed <eos> the <unk> has already begun <eos> the equity market was <unk> <eos> once again the specialists were not able to handle the imbalances on the floor of the new york stock exchange said christopher <unk> senior vice president at <unk> securities corp <eos> <unk> james <unk> chairman of specialists henderson brothers inc. it is easy to say the specialist is n't doing his job <eos> when the dollar is in a <unk> even central banks ca n't stop it <eos> speculators are calling for a degree of liquidity that is not there in the market <eos> many money managers and some traders had already left their offices early friday afternoon on a warm autumn day because the stock market was so quiet <eos> then in a <unk> plunge the dow

Define your own trainer extensions in Chainer

So how to implement custom extensions for trainer in Chainer? There are mainly 3 approaches.

  1. Define function
  2. Use decorator,
  3. Define class

Most of the case, 1. Define function is the easiest way to quickly implement your extension.

1. Define function

Just a function can be a trainer extension. Simply, define a function which takes one argument (in below case “t”), which is trainer instance

1-1. define function

    # 1-1. Define function for trainer extension
    def my_extension(t):
        print('my_extension function is called at epoch {}!'
        # Change optimizer's learning rate *= 0.99
        print('Updated to {}'.format(

    trainer.extend(my_extension, trigger=(1, 'epoch'))

Here the argument of my_extension function, t, is trainer instance. You may obtain a lot of information related to the training procedure from trainer. In this case, I took the current epoch information by accessing updater’s property (trainer holds updater’s instance), t.updater.epoch_detail.

The extension is invoked based on the trigger configuration. In above code trigger=(1, 'epoch') means that this extension is invoked every once in one epoch. 

Try changing the code from trainer.extend(my_extension, trigger=(1, 'epoch')) to trainer.extend(my_extension, trigger=(1, 'iteration')). Then the code is invoked every one iteration (Causion: it outpus the log very frequently, please stop it after executed and you have checked the behavior). 

1-2. Use lambda

Instead of defining a function explicitly, you can simply use lambda function if the extension’s logic is simple.

    # Use lambda function for extension
    trainer.extend(lambda t: print('lambda function called at epoch {}!'
                   trigger=(1, 'epoch'))

2. Use make_extension decorator on function


3. Define as a class

Predict code for Penn Bank Tree (ptb) dataset

Predict code is pretty much the same with Predict code for simple sequence dataset, so I won’t explain in detail.


The code is on the

"""Inference/predict code for simple_sequence dataset

model must be trained before inference, must be executed beforehand.
from __future__ import print_function

import argparse
import os
import sys

import matplotlib
import numpy as np

import matplotlib.pyplot as plt
import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training, iterators, serializers, optimizers, Variable, cuda
from import extensions

from RNN import RNN
from RNN2 import RNN2
from RNN3 import RNN3
from RNNForLM import RNNForLM

def main():
    archs = {
        'rnn': RNN,
        'rnn2': RNN2,
        'rnn3': RNN3,
        'lstm': RNNForLM

    parser = argparse.ArgumentParser(description='simple_sequence RNN predict code')
    parser.add_argument('--arch', '-a', choices=archs.keys(),
                        default='rnn', help='Net architecture')
    #parser.add_argument('--batchsize', '-b', type=int, default=64,
    #                    help='Number of images in each mini-batch')
    parser.add_argument('--unit', '-u', type=int, default=100,
                        help='Number of LSTM units in each layer')
    parser.add_argument('--gpu', '-g', type=int, default=-1,
                        help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--primeindex', '-p', type=int, default=1,
                        help='base index data, used for sequence generation')
    parser.add_argument('--length', '-l', type=int, default=100,
                        help='length of the generated sequence')
    parser.add_argument('--modelpath', '-m', default='',
                        help='Model path to be loaded')
    args = parser.parse_args()

    print('GPU: {}'.format(args.gpu))
    #print('# Minibatch-size: {}'.format(args.batchsize))

    train, val, test = chainer.datasets.get_ptb_words()
    n_vocab = max(train) + 1  # train is just an array of integers
    print('#vocab =', n_vocab)

    # load vocabulary
    ptb_word_id_dict = chainer.datasets.get_ptb_words_vocabulary()
    ptb_id_word_dict = dict((v, k) for k, v in ptb_word_id_dict.items())

    # Model Setup
    model = archs[args.arch](n_vocab=n_vocab, n_units=args.unit)
    classifier_model = L.Classifier(model)
    if args.gpu >= 0:
        chainer.cuda.get_device(args.gpu).use()  # Make a specified GPU current
        classifier_model.to_gpu()  # Copy the model to the GPU
    xp = np if args.gpu < 0 else cuda.cupy

    if args.modelpath:
        serializers.load_npz(args.modelpath, model)
        serializers.load_npz('result/{}_ptb.model'.format(args.arch), model)

    # Dataset preparation
    prev_index = args.primeindex

    # Predict
    predicted_sequence = [prev_index]
    for i in range(args.length):
        prev = chainer.Variable(xp.array([prev_index], dtype=xp.int32))
        current = model(prev)
        current_index = np.argmax(cuda.to_cpu(
        prev_index = current_index

    predicted_text_list = [ptb_id_word_dict[i] for i in predicted_sequence]
    print('Predicted sequence: ', predicted_sequence)
    print('Predicted text: ', ' '.join(predicted_text_list))

if __name__ == '__main__':

Given the first text by the index, args.primeindex, model will predict the following sequence as word id.

The last three line converts the word id sequence into readable word sentence using ptb_id_word_dict.

    predicted_text_list = [ptb_id_word_dict[i] for i in predicted_sequence]
    print('Predicted sequence: ', predicted_sequence)
    print('Predicted text: ', ' '.join(predicted_text_list))


When I run, (the model is RNN model)

$ python -p 553

I got the text

Predicted text: executive vice president and chief operating officer of <unk> <unk> & <unk> a <unk> mass. newsletter <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk> from the <unk> <eos> the <unk> <unk> <unk> <unk> <unk> <unk>

It seems the model can predict a first shot sentence but once it has reached to <unk> or <eos>, it will keep returning the same symbol. Also “the” will appear quite often than other words.

I think the model is not trained well enough yet, and you may try training the model more to get more good result!

Long Short Term Memory (LSTM) introduction

Long Short Term Memory

Diagrom of Long Short Term Memory. Cite from Originally created by Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton

Long short term memory is advanced version of RNN, which have “Cell” c to keep long term information.

LSTM network Implementation with Chainer

LSTM function and link is provided by Chainer, so we can just use it to construct a neural network with LSTM.

Sample implementation is following, (referred from official example code)

import numpy as np

import chainer
import chainer.functions as F
import chainer.links as L

# Copied from chainer examples code
class RNNForLM(chainer.Chain):
    """Definition of a recurrent net for language modeling"""

    def __init__(self, n_vocab, n_units):
        super(RNNForLM, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.l1 = L.LSTM(n_units, n_units)
            self.l2 = L.LSTM(n_units, n_units)
            self.l3 = L.Linear(n_units, n_vocab)

        for param in self.params():
  [...] = np.random.uniform(-0.1, 0.1,

    def reset_state(self):

    def __call__(self, x):
        h0 = self.embed(x)
        h1 = self.l1(F.dropout(h0))
        h2 = self.l2(F.dropout(h1))
        y = self.l3(F.dropout(h2))
        return y

Update: [Note]

self.params() will return all the “learnable” parameter in this Chain class (for example W and b in Linear link to calculate x * W + b

Thus, below code will replace all the initial parameter by uniformly distributed value between -0.1 and 0.1.

for param in self.params():
  [...] = np.random.uniform(-0.1, 0.1,

Appendix: chainer v1 code

It was written as follows until chainer v1. From Chainer v2, the train flag in function (ex. dropout function) has been removed ans chainer global config is used instead.

import numpy as np

import chainer
import chainer.functions as F
import chainer.links as L

# Copied from chainer examples code
class RNNForLM(chainer.Chain):
    """Definition of a recurrent net for language modeling"""

    def __init__(self, n_vocab, n_units, train=True):
        super(RNNForLM, self).__init__()
        with self.init_scope():
            self.embed = L.EmbedID(n_vocab, n_units)
            self.l1 = L.LSTM(n_units, n_units)
            self.l2 = L.LSTM(n_units, n_units)
            self.l3 = L.Linear(n_units, n_vocab)

        for param in self.params():
  [...] = np.random.uniform(-0.1, 0.1,
        self.train = train

    def reset_state(self):

    def __call__(self, x):
        h0 = self.embed(x)
        h1 = self.l1(F.dropout(h0, train=self.train))
        h2 = self.l2(F.dropout(h1, train=self.train))
        y = self.l3(F.dropout(h2, train=self.train))
        return y

Training LSTM model with Penn Bank Tree (ptb) dataset

This post mainly explains, uploaded on github.

We have already learned RNN and LSTM network architecture, let’s apply it to PTB dataset.

It is quite similar to explained in Training RNN with simple sequence dataset, so no much explanation is necessary.

Train code

I will just paste whole the training code for PTB at first,

RNN Training code with Penn Treebank (ptb) dataset
from __future__ import print_function

import os
import sys
import argparse

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training, iterators, serializers, optimizers
from import extensions

from RNN import RNN
from RNN2 import RNN2
from RNN3 import RNN3
from RNNForLM import RNNForLM
from parallel_sequential_iterator import ParallelSequentialIterator
from bptt_updater import BPTTUpdater

# Routine to rewrite the result dictionary of LogReport to add perplexity
# values
def compute_perplexity(result):
    result['perplexity'] = np.exp(result['main/loss'])
    if 'validation/main/loss' in result:
        result['val_perplexity'] = np.exp(result['validation/main/loss'])

def main():
    archs = {
        'rnn': RNN,
        'rnn2': RNN2,
        'rnn3': RNN3,
        'lstm': RNNForLM

    parser = argparse.ArgumentParser(description='RNN example')
    parser.add_argument('--arch', '-a', choices=archs.keys(),
                        default='rnn', help='Net architecture')
    parser.add_argument('--unit', '-u', type=int, default=100,
                        help='Number of RNN units in each layer')
    parser.add_argument('--bproplen', '-l', type=int, default=20,
                        help='Number of words in each mini-batch '
                             '(= length of truncated BPTT)')
    parser.add_argument('--batchsize', '-b', type=int, default=10,
                        help='Number of images in each mini-batch')
    parser.add_argument('--epoch', '-e', type=int, default=10,
                        help='Number of sweeps over the dataset to train')
    parser.add_argument('--gpu', '-g', type=int, default=-1,
                        help='GPU ID (negative value indicates CPU)')
    parser.add_argument('--out', '-o', default='result',
                        help='Directory to output the result')
    parser.add_argument('--resume', '-r', default='',
                        help='Resume the training from snapshot')
    args = parser.parse_args()

    print('GPU: {}'.format(args.gpu))
    print('# Architecture: {}'.format(args.arch))
    print('# Minibatch-size: {}'.format(args.batchsize))
    print('# epoch: {}'.format(args.epoch))

    # 1. Load dataset: Penn Tree Bank long word sequence dataset
    train, val, test = chainer.datasets.get_ptb_words()
    n_vocab = max(train) + 1  # train is just an array of integers
    print('# vocab: {}'.format(n_vocab))

    # 2. Setup model
    model = archs[args.arch](n_vocab=n_vocab,
                             n_units=args.unit)  # , activation=F.tanh
    classifier_model = L.Classifier(model)
    classifier_model.compute_accuracy = False  # we only want the perplexity

    if args.gpu >= 0:
        chainer.cuda.get_device(args.gpu).use()  # Make a specified GPU current
        classifier_model.to_gpu()  # Copy the model to the GPU

    eval_classifier_model = classifier_model.copy()  # Model with shared params and distinct states
    eval_model = classifier_model.predictor

    # 2. Setup an optimizer
    optimizer = optimizers.Adam(alpha=0.001)
    #optimizer = optimizers.MomentumSGD()

    # 4. Setup an Iterator
    train_iter =ParallelSequentialIterator(train, args.batchsize)
    val_iter = ParallelSequentialIterator(val, 1, repeat=False)
    test_iter = ParallelSequentialIterator(test, 1, repeat=False)

    # 5. Setup an Updater
    updater = BPTTUpdater(train_iter, optimizer, args.bproplen, args.gpu)
    # 6. Setup a trainer (and extensions)
    trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

    # Evaluate the model with the test dataset for each epoch
    trainer.extend(extensions.Evaluator(val_iter, eval_classifier_model,
                                        # Reset the RNN state at the beginning of each evaluation
                                        eval_hook=lambda _: eval_model.reset_state())

    trainer.extend(extensions.snapshot(), trigger=(1, 'epoch'))
    interval = 500
                                        trigger=(interval, 'iteration')))
        ['epoch', 'iteration', 'perplexity', 'val_perplexity', 'elapsed_time']
    ), trigger=(interval, 'iteration'))
        ['perplexity', 'val_perplexity'],
        x_key='epoch', file_name='perplexity.png'))


    # Resume from a snapshot
    if args.resume:
        serializers.load_npz(args.resume, trainer)

    # Run the training
                         .format(args.out, args.arch), model)

    # Evaluate the final model
    evaluator = extensions.Evaluator(test_iter, eval_classifier_model, device=args.gpu)
    result = evaluator()
    print('test perplexity:', np.exp(float(result['main/loss'])))

if __name__ == '__main__':

I will explain different point from simple_sequence dataset in the following.

PTB dataset preparation: train, validation and test

Dataset preparation is done by get_ptb_words method provided by chainer,

    # 1. Load dataset: Penn Tree Bank long word sequence dataset
    train, val, test = chainer.datasets.get_ptb_words()
    n_vocab = max(train) + 1  # train is just an array of integers
    print('# vocab: {}'.format(n_vocab))

Note that PTB dataset consists of trainvalidation and test data, while previous project like MNIST, CIFAR-10, CIFAR-100 consisted of train and test data.

In above training code, we use train dataset for train the model, validation dataset to monitor the validation loss during the training (for example you may tune hyper parameter using validation loss), and test dataset only after the training is completely finished to just check/evaluate the model’s performance.

Monitor the loss by perplexity

In NLP, it is common to measure the model’s performance by perplexity, instead of softmax cross entropy or correct percentage.

Perplexity of a probability distribution

The perplexity of a discrete probability distribution p is defined as2^{{H(p)}}=2^{{-\sum _{x}p(x)\log _{2}p(x)}}

Perplexity per word

In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts.

cite from

It is calculated easily by just take exponential of the mean softmax cross entropy loss

result['perplexity'] = np.exp(result['main/loss'])

and in chainer, we can show it by LogReport extension. 

It is done by passing post processing function “compute_perplexity” into LogReport argument.

# Routine to rewrite the result dictionary of LogReport to add perplexity
# values
def compute_perplexity(result):
    result['perplexity'] = np.exp(result['main/loss'])
    if 'validation/main/loss' in result:
        result['val_perplexity'] = np.exp(result['validation/main/loss'])


    interval = 500
                                        trigger=(interval, 'iteration')))

LogReport‘s postprocess argument will take a function, where the function will take the argument “result” which is a dictionary containing the repoted value.

Since ‘main/loss’ and ‘validation/main/loss’ is reported by Classifier and Evaluator, we can extract these values from result to calculate perplexity and val_perplexity. When it is set to result dictionary, it can be shown by PrintReport by the same key name.

Setup IDE

If this is the first time to use python and you have not built any python development environment, setting up IDE (Integrated Development Environment) might be a one good choice to start coding quite easily. I will introduce how to setup PyCharm, one of the major python development tool, which I am also using heavily 🙂 

Skip this section if you have already setup your python develop environment. It is ok to use favorite development environment.


Refer official site for the details and download the software to install it.

PyCharm supports Windows, Mac, Linux.


There are 2 types, Free Community Edition and Paid Professional Edition. 

Mainly, the difference is that professional edition supports web framework, profiling and remote (ssh) support.

In terms of our purpose, machine learning, I personally feel the necessity of each feature as follows,

  1. Web development: we do not develop web, not necessary.
  2. Profiler: it is nice to have, but profiler is necessary only when you need to optimize/tune the code behavior. Here we are just using the deep learning framework (chainer) and we can develop without profiler.
  3. Remote support: When you are accessing remote Linux server for calculation (For example you have GPU desktop PC, and accessing it from note PC). Then this feature is quite useful, developing code can be sent directly to remote PC from local PC via PyCharm. Also, you may run the code in remote environment from PyCharm GUI.

Summary, if you are accessing remote Linux server for calculation, it is nice to consider purchasing professional edition. Otherwise, it is good enough to use community edition.


What is nice in PyCharm?

I just listed up useful features supported by PyCharm, 

[WIP] I’d like to explain these with animated gif in the future.

  • Easy to setup

GUI button to run the code, Color theme

  • Completion

When you are typing the method name, PyCharm shows the completion list to let user to just press TAB to complete the coding.

  • Type hiting, PEP 8 code format hinting

PyCharm estimates the python codes variable type, to show method completion etc.

Also, it statically analyze the code to show PEP 8 code formatting notification. You can automatically learn/write the “recommended” python way of writing.

  • Auto-formatting

Indent will be added automatically when you go to next line.

  • Customizable shortcut key binding

Emacs, vim-like key binding is also supported as default.

Really many of the functionality is supported and can be bound by shortcut key.

  • Live template code

You can register “live template”, which can be expanded with abbreviation, for fast coding.

  • Refactoring

Refactoring is one of the main strong point of PyCharm (Intelli J products).

You can change variable name on the source code at once, move the class to other file with automatically affect import dependencies etc. 

  • Jump around the code to see its parent definition

Source Code reading is easy.

Ctrl+Q to see the explation of methods.

Ctrl+Enter to jump the parent definition.

  • Debug support

Next to the “run” button, there is “debug” button to debug python script easily.

You may also visually put break point and see the variable’s value at specific point etc.

  • Easy project wise python version control

PyCharm saves the configuration for each project, and you may specify “conda version” for each project.

It is easy to switch python 2 and python 3 depending on the project.

  • Plugin support

Third-party (even you) can develop the PyCharm plugin and you can install these plugins for more convenient use.

Chainer sklearn wrapper

If you are familiar with machine learning before deep learning becomes popular, you might have been using sklearn (scikit-learn), which is very popular machine learning library in python.

Its interface is used for a long time, and I thought it is better to support this interface with python to allow users to try deep learning more easily! I wrote Chainer sklearn wrapper.

Here, I will explain how to use

Construct the model

Mainly, conventional machine learning task can be categorized in following three:

  • Classification – Classify the input’s class(label), sometimes the output is probability of being each class(label).
  • Regression – Predict the target feature’s value based on the input features’ value. 
  • Clustering – Given only input without label, make a group whose feature is similar in input space.

I want to support classifier model and regression model in deep learning.

Classifier model

You can use SklearnWrapperClassifier class, it can be constructed in almost same way with current Classifier class in Chainer. Just define your own predictor model and set it to classifier.

# Network definition
class MLP(chainer.Chain):

    def __init__(self, n_units, n_out):
        super(MLP, self).__init__()
        with self.init_scope():
            # the size of the inputs to each layer will be inferred
            self.l1 = L.Linear(None, n_units)  # n_in -> n_units
            self.l2 = L.Linear(None, n_units)  # n_units -> n_units
            self.l3 = L.Linear(None, n_out)  # n_units -> n_out

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)

# Construct classifier model
n_unit = 50
model = SklearnWrapperClassifier(MLP(n_unit, 10))

Regression model

[WIP] Currently it is not implemented yet..

Training the model with fit

Once the model is constructed, you can call fit function in the same way as sklearn.

Example 1. Iris data classification

Prepare the input data X and target data (label) y, and call fit.

    # Load the iris dataset
    data, target = load_iris(return_X_y=True)
    X = data.astype(np.float32)
    y = target.astype(np.int32)

    # Construct model
    model = SklearnWrapperClassifier(MLP(args.unit, 10), device=args.gpu)

    # Train the model with fit, y, epoch=args.epoch)

See for whole training code.

Example 2. MNIST data classification

You can also use fit function with Chainer’s dataset class.

Below example shows to fit the model using Chainer’s TupleDataset.

    # Load the MNIST dataset
    train, test = chainer.datasets.get_mnist()

    model = SklearnWrapperClassifier(MLP(args.unit, 10), device=args.gpu)

See for whole training code.

Predict with trained model

You can use predict function to get the classification result, and predict_proba method to get the probability for being each class.

    # --- Example 1. Predict all test data ---
    outputs = model.predict(test,

    x, t = model.inputs

Set retain_inputs option to True to retrieve the model inputs. This convenient method is useful for chainer dataset because for example data augmentation preprocessing is done every time of when the data is accessed by index (get_example method of DatasetMixin), and thus it is not guaranteed to get same input when accessed the data with same index.

You may also predict only sliced data,

    outputs = model.predict_proba(test[:20])
    x, t = model.inputs
    #y, = outputs
    y = outputs

See for whole training code.

GridSearchCV, RandomizedSearchCV

It also supports to use GridSearchCV and RandomizedSearchCV implemented in sklearn for automated hyper parameter search.

One example is as follows,

    predictor = MLP(args.unit, 10)
    model = SklearnWrapperClassifier(predictor, device=args.gpu)
    optimizer1 = chainer.optimizers.SGD()
    optimizer2 = chainer.optimizers.Adam()
    gs = GridSearchCV(model,
                          # hyperparameter search for predictor
                          #'n_units': [10, 50],
                          # hyperparameter search for different optimizers
                          'optimizer': [optimizer1, optimizer2],
                          # 'batchsize', 'epoch' can be used as hyperparameter
                          'epoch': [args.epoch],
                          'batchsize': [100, 1000],
                          'progress_report': False,
                      }, verbose=2)

    best_model = gs.best_estimator_
    best_mlp = best_model.predictor

    # Save trained model
    serializers.save_npz('{}/best_mlp.model'.format(args.out), best_mlp)

When you want to search the hyper parameter which is used in predictor’s constructor or optimizer’s constructor, you can search these hyper parameters as well.

See for whole training code and more details.

Install Chainer

Once you have setup python environment, now we can install chainer.

Nowadays it is important for deep learning library to use GPU to enhance its calculation speed, and Chainer provides several levels of GPU support.

The packages which you need to install depends on your PC envitonment, please follow this figure to understand what category you are in.

The chart to specify what you need to install.

Note that the upper category includes the lower dependencies. For example, if you are categorized to B., then you need to install A. chainer and B. Cupy, CUDA, cudnn

I think most of the users are categorized as A. or B., C. and D. are for only professionals.

A: CPU user

If you don’t have NVIDIA GPU and GPU support is not necessary, you only need to install chainer.

The difference of using GPU (category B., C., D.) is basically only the calculation speed, you can enjoy trying all the functionality of chainer.

※ It is ok to try using chainer with only CPU. But when you want to seriously run deep learning code with big data (such as images etc), you might feel it runs too slow with only CPU.

Install chainer

pip install chainer

Just one line, that’s all :).

Update chainer

Chainer development is very active and its version update is released very often (see Release Cycle for details).

To check the current version: you can type following in command line,

python -c "import chainer; print(chainer.__version__)"

To install latest version of chainer

pip install -U chainer

B: 1 GPU user

If you have NVIDIA GPU, you can get benefit for calculation enhancement by installing cupy, which is like GPU version of numpy. GPU is good at parallel computing, and once you setup CUDAcudnn, and cupy it may sometimes more than 10 times faster than using only CPU, depending on the task (for example, CNN used in image processing can get this benefit a lot!). 

Before installing cupy, you need to install/setup CUDA and cudnn library.

Install CUDA

According to the official website of CUDA,

CUDA® is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

Please follow official download page to get CUDA library and install it.

Install cudnn

The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.

Even without cudnn, and only with CUDA, chainer can use GPU calculation. However if you install cudnn, its calculation is more highly optimized. Especially, GPU memory usage reduces dramatically and thus I recommend to install cudnn as well.

Please go to cudnn download page to get cudnn library and install. You might need to register membership to download cudnn library.

Install cupy

pip install cupy

Basically that’s all. See install cupy for details.

When you install CuPy without CUDA, and after that you want to use CUDA, please reinstall CuPy. You need to reinstall CuPy when you want to upgrade CUDA.

Reinstall chainer, cupy

When you reinstall the package, it is better to specify explicitly not to use cached file.

pip uninstall chainer
pip uninstall cupy

pip install chainer --no-cache-dir -vvvv
pip install cupy --no-cache-dir -vvvv

C: Multiple GPUs within 1 PC

If you have more than one GPU, and you want to use these GPUs in one model training, you can use MultiProcessParallelUpdater module in chainer.

To use this module, you need to install NCCL library.

※ If you are not going to train the model with multiple GPUs, you can skip installing NCCL Even you have multiple GPUs. For example, you can run training process for “model A” with gpu 0, and another training process for “model B” with GPU 1 at the same time without NCCL setup.

Install NCCL

NCCL (pronounced “Nickel”) is a stand-alone library of standard collective communication routines, such as all-gather, reduce, broadcast, etc., that have been optimized to achieve high bandwidth over PCIe.

Please follow official github page to install NCCL.

Below was my case, but it might change in the future.

cd your-workspace-path
git clone
cd nccl
# use your own CUDA_HOME path
make CUDA_HOME=/usr/local/cuda-8.0 test

After installed NCCL, please setup the PATH as well.

export NCCL_ROOT=<path to NCCL directory>
export CPATH=$NCCL_ROOT/include:$CPATH

and then reinstall chainer and cupy. If successful, you can use MultiProcessParallelUpdater for multiple GPUs model training.

D: Cluster with multiple GPUs

This is specially professional usage where you are setting up GPU clusters.

You can use chainermn for multi-node, distributed deep learning training.

It is reported that chainermn scales almost linearly even using 128 GPUs, and completes ImageNet training in 4.4 hours.

Install CUDA aware API

Please follow official chainermn documentation.

Install chainermn

pip install chainermn

For details, please follow official chainermn documentation.

Setup python environment

※ This post is mainly just a summary/translation of the Japanese blog,

TL;DR; I recommend to install “anaconda” instead of using “official python package”.

If you just want to proceed environment setup, jump to “Environment setup for each OS”. At first, I will explain little bit about the background knowledge of python & anaconda. 

Python version

Python version 2 and 3 are distributed, current latest version is python 2.7 and python 3.6.

Several years ago, it is said that “some library is still not compatible with python 3.x, and thus python 2.x is recommended”. However, now most of the popular library works well with python 3.x.

Here, I recommend to install python 3.x as a default environment, and switch to python 2.7 if necessary using conda‘s virtual environment functionality.

Problems for python development setup

When you use pure system python, you will face following problems. These problems can be solved with anaconda!

  • Version control: Change python version depending on the project.
    – Depending on the library you may need to change python2/python3 environment.
    – When you want to run other person’s code, sometimes it is written in python2 and sometimes in python3.
    → conda create command to create another environment with specific python version.
  • Development environment management
    – You might want to use developed branch/specific version of library only for specific project. You need to prepare multiple development environment to control python library version.
    → conda create command to create another environment.
  • pip install fails with some library for compilation depends on OS: 
    – Especially this happens for Windows users. Some library is only distributed for Linux user and compilation fails when installing with pip command.
    → Try ‘conda install library-name' to install library.
  • python 2.x is pre-installed to system on Linux/Mac
    – How to use python 3.x without conflicting with system python 2.x.
    → Use pyenv to avoid conflict with system python.

What is Anaconda?

Anaconda is one of the python distribution package, which includes popular libraries from default (numpy, scipy, pandas, ipython, jupyter, scikit-learn etc…).

  • There is also miniconda, which includes minimum package. 

Python version: Both python 2.x and python 3.x version are distributed. 

OS support: Linux, Max, Windows version are available, supports both 32 bit & 64 bit.


Anaconda is BSD licensed which gives you permission to use Anaconda commercially and for redistribution.

What is conda?

When you install anaconda package, you can use conda .

Package management: conda is package management tool, which can be considered as an alternative for pip.

  • It supports over 400 packages
  • pip tries to compile the package in client environment, and the compile sometimes fails depending on your environment (OS, library etc).
  • conda provides pre-compiled package, and it reduces install failure case.
  • It does not interfere with pip command, you can still use pip if the package is not included in conda

Version controlconda supports python version control, as an alternative for pyenv

For example, you can create python 2.7 virtual environment named ‘py27’ by

conda create -n py27 python=2.7

To enter this virtual environment,

source activate py27

Virtual environment management: as an altenative for virtualenv/venv

Environment setup for each OS


Python is not pre-installed on windows, you can just install anaconda.


1. Install anaconda

You can download installer from official anaconda download site. Follow instruction of exe installer, it also manages to add system PATH environment for the convenience.

I recommend to install latest version (anaconda 4.4.0, python 3.6 at the time of writing 2917/6/28), you may create another python version (e.g. python 2.7) virtual environment easily after the installation.

2. Check installation (you may skip this)

Launch command line (Press Windows key, type ‘cmd’ and enter).

C:\Users\corochann>python --versionPython 3.5.2 :: Anaconda custom (64-bit) C:\Users\corochann>pip --versionpip 9.0.1 from c:\program files\anaconda3\lib\site-packages (python 3.5) C:\Users\corochann>conda --versionconda 4.3.21


Python is pre-installed on system, and it is usually python 2.7. You need to configure to use installed anaconda.

However, if you only install anaconda, it also installs curl, sqlite, openssl and override additional commands, which might cause conflict with existing environment.

Recommended way is to install anaconda on top of pyenv.

python environment architecture. After the environment setup, user1 can use anaconda3 (python 3) or virtual env py27 (python 2.7), which is independent from pre-installed system python (/usr/bin/python).

See this figure, assuming you are user1.

  1. As default, you can use anaconda3 which is python 3.x.
  2. If you create virtual env (ex, ‘py27’), you can use python 2.7 as well.
  3. It is user-dependent configuration, and does not affect to other user.
    If other user (user2, user3) did not setup, they will use system python.

This configuration has another advantage that your configuration does not affect to other user, it is good for construct work environment in shared server.


1. Install pyenv

Execute below commands in terminal,

$ git clone ~/.pyenv
$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
$ echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
$ echo 'eval "$(pyenv init -)"' >> ~/.bashrc
$ source ~/.bashrc

First line clones the package
2nd – 4th line will add necessary environment setup command to .bashrc
Last line will initialize system with modified .bashrc

2. Install anaconda: you can install either python 3.x package or python 2.x package.

I think it is ok to install python 3.x version as a default, and you don’t need to install both because python version 2/3 can be switched with conda

# Check latest version, anaconda3-4.2.0 (anaconda2-4.2.0 for python 2.7)
$ pyenv install -l | grep ana

# Install anaconda, and configure to set anaconda as default python
$ pyenv install anaconda3-4.2.0
$ pyenv rehash
$ pyenv global anaconda3-4.2.0

# set PATH to avoid `activate` command conflict between pyenv and anaconda (use anaconda's activate)
$ echo 'export PATH="$PYENV_ROOT/versions/anaconda3-4.2.0/bin/:$PATH"' >> ~/.bashrc
$ source ~/.bashrc

# update conda itself
$ conda update conda

[Note] If you prefer, you may install miniconda instead of anaconda with the similar procedure.

3. Check installation (you may skip this)

$ python --version
Python 3.5.2 :: Anaconda custom (64-bit)
$ pip --version
pip 9.0.1 from /home/corochann/.pyenv/versions/anaconda3-4.2.0/lib/python3.5/site-packages (python 3.5)
$ conda --version
conda 4.3.22

If python and pip uses anaconda’s command under user’s directory, installation is ok. If it looks system python (python 2.7), installation is not successful.


I don’t have Mac, sorry. But the basic procedure is same with Linux except that pyenv installation is via homebrew.

conda basic usage

virtual environment

  • Create virtual env

conda create -n <environment-name> python=<version> <install libraries with space separated>

$ conda create -n py27 python=2.7 numpy scipy pandas jupyter

# "anaconda" option indicates install popular modules in package
$ conda create -n anaconda2 python=2.7 anaconda
  • Check virtual env
conda env list

# or
conda info -e
  • Switch virtual env
# Enter virtual env
# `activate py27` for Windows
source activate py27

# Exit virtual env
# `deactivate` for windows
source deactivate
  • Delete virtual env
conda remove -n py27 --all

package management

  • Install/uninstall package
# install
conda install numpy scipy  # specify multiple libraries, like pip
conda install numpy=1.10.4 # specify version
conda install -n py2 numpy scipy # -n option to specify virtual env

conda update numpy # update

pip install numpy  # pip can be used as well. Use it when the library is not in conda
source activate py2;pip install numpy # Install library in virtual env, install it after `activate`

# uninstall
conda uninstall -n py2 numpy 
  • Check package
# Show current installed package list
conda list

# -n option to specify virtual env
conda list -n py27

# Export list and use it in another environment
# However, package installed with `pip` cannot be exported.
# use `pip freeze` to output the list of package installed with pip.
conda list --export > env.txt
conda create -n py27_copy --file env.txt
  • Search package in anaconda cloud

Even the package is not distributed by official anaconda, other third-party may be uploaded to anaconda cloud (

It is useful to check the package is distributed under anaconda cloud, and install it. 

To search,

anaconda search -t conda <package-name-to-search>

And once found, to install third party library,

conda install -c <USER> <PACKAGE>

here <USER> means third party’s name, and <PACKAGE> means the package name to install.

Below example shows how to install rdkit package, which is not distributed under anaconda but by rdkit.

anaconda search -t conda rdkit
Using Anaconda API:
Run 'anaconda show <USER/PACKAGE>' to get more details:
     Name                      |  Version | Package Types   | Platforms
     ------------------------- |   ------ | --------------- | ---------------
     Clyde_Fare/rdkit          | 2015.09.2 | conda           | win-64
     Guillopflaume/rdkit       | 2014.09.1 | conda           | linux-64
     Guillopflaume/rdkit-postgresql | 2014.09.1 | conda           | linux-64
     RMG/rdkit                 | 2016.03.4 | conda           | linux-64, win-32, win-64, linux-32, osx-64
                                          : Open-Source Cheminformatics
     aschreyer/rdkit           | 2015.03.1 | conda           | osx-64
                                          : RDKit is an open source toolkit for cheminformatics.
     bioconda/rdkit            | 2016.03.3 | conda           | linux-64
                                          : Open-Source Cheminformatics Software
     connie/rdkit              | 2015.09.2 | conda           | linux-32
     eleonora1990/rdkit        | 2014.09.2 | conda           | linux-64
     eleonora1990/rdkit-postgresql | 2014.09.2 | conda           | linux-64
     greglandrum/rdkit         | 2017.03.1 | conda           | linux-64, win-32, win-64, osx-64
     greglandrum/rdkit-postgresql95 | 2016.03.4 | conda           | osx-64
     grizzly41/rdkit           | 2016.09.1.dev20160806 | conda           | osx-64
     jeprescottroy/rdkit       | 2016.03.3 | conda           | osx-64
     jochym/rdkit              | 2015_03_1 | conda           | linux-64
     karlleswing/rdkit         | 2016.09.2 | conda           | linux-64
     mforsythe/rdkit           | 2014.03.1 | conda           | osx-64
     mgbarnes/rdkit            | 2016.03.1 | conda           | osx-64
     mobleylab/rdkit           | 2016.03.1 | conda           | linux-64, osx-64
     mpharrigan/rdkit          |          | conda           | linux-64
     mwojcikowski/rdkit        | 2016.03.1 | conda           | linux-64
     nickvandewiele/rdkit      | 2015.09.2 | conda           | linux-64, win-32, osx-64, linux-32, win-64
                                          : Open-Source Cheminformatics
     nividic/rdkit             | 2016.03.1 | conda           | linux-64, osx-64
                                          : Cheminformatics Molecule Framework
     olexandr/rdkit            | 2016.03.1 | conda           | linux-64
     omnia/rdkit               | 2015.09.1 | conda           | linux-64, osx-64
                                          : Open-Source Cheminformatics Software
     <strong>rdkit/rdkit               | 2017.03.2 | conda           | linux-64, win-32, win-64, linux-32, osx-64</strong>
     rdkit/rdkit-postgresql    | 2016.03.4 | conda           | linux-64
                                          : RDKit cartridge for PostgreSQL
     rdkit/rdkit-postgresql95  | 2016.09.4 | conda           | linux-64
                                          : RDKit cartridge for PostgreSQL v9.5
     richlewis/rdkit           | 2016.03.1 | conda           | linux-64, win-64, linux-32, osx-64
     rmcgibbo/rdkit            | 2014.03.1 | conda           | linux-64
                                          : Open source toolkit for cheminformatics
     rmcgibbo/rdkit-utils      |      0.1 | conda           | linux-64
                                          : Utilities for working with the RDKit
     skearnes/rdkit            | 2014.03.1 | conda           | linux-64
                                          : Vanilla RDKit build without Avalon or InChI support.
     thibaudfreyd/rdkit        | 2016.03.1 | conda           | osx-64
     twz915/rdkit              | 2016.03.1 | conda           | osx-64
     zero323/rdkit             | 2015.09.2 | conda           | linux-64
Found 34 packages

rdkit is distributed by many third party. Here, let’s install rdkit/rdkit.

conda install -c rdkit rdkit

Appendix: install R with conda

R is also a popular language for data science community. Not only python, but R can be installed with conda.

conda create -n r -c r r-irkernel

Next: Install Chainer