Library release: visualize saliency map of deep neural network

Japanese is available at Qiita.

From left: 1. Classification saliency map visualization of VGG16, CNN model. 2. iris dataset feature importance calculation of MLP model. 3. Water solubility contribution visualization of Graph convolutional network model.


Have you ever thought “Deep neural network is highly complicated black box, no one ever able to see what happens inside to result this output.”?

Even though NN consists of many layers and its mathematical analysis is difficult, there are some researches to show some saliency map like above images to understand model’s behavior or to get new knowledge for the dataset.

These saliency map calculation methods are implemented in Chainer Chemistry (even though the name contains “chemistry”, saliency module is available in many domains, as explained below). I will briefly explain how these work, and how to use it. You can also show these visualization figures after read this (a little bit long) article, enjoy!

It starts from theoretical explanation, followed by the code to use the module. Please jump to the “Examples” section if you just want to use it.

The code in this article is uploaded on github


What is reasoning of NN?

3 saliency calculation methods are implemented so far in chainer chemistry.

– VanillaGrad
– IntegratedGradient
– Occlusion

These methods calculate the contribution to the model’s prediction for each data.

※ Note that feature importance used in Random forest or XGBoost are calculated for the model. There is a difference that it is not calculated for “each data”.

Brief introduction – VanillaGrad

This method calculates derivative of output y with respect to input x, as a input contribution to the output prediction.

$$ s_i = \frac{dy}{dx_i} $$

Here, \(s_i\) is the saliency score, which is calculated for each input data’s \(i\)-th element \(x_i\). When the value of gradient is large for some element, the value change of this element results in big change of output prediction. So this element should have larger saliency (importance).

In terms of implementation, it is simply written as follows with chainer.

y = model(x)
s = x.grad

Saliency module usage

Calculator class calculates saliency score, like VaillaGrad, IntegratedGradient, or Occlusion.

Visualizer class visualizes calculated saliency score.

Calculator can be used with various NN model, which does not restrict the domain or application. Visualizer can be implemented to adopt Application for the proper visualization for the domain.

Basic usage flow is to call Calculator computeaggregate -> Visualizer visualize 

# model is chainer.Chain, x is dataset
calculator = GradientCalculator(model)
saliency_samples = calculator.compute(x)
saliency = calculator.aggregate(saliency_samples)

visualizer = ImageVisualizer()

Calculator class

Here I use GradientCalculator as an example which calcultes VanillaGrad explained above. Let’s see how to call each method.


Instance with passing model, which is the target neural network to calculate saliency.

calculator = GradientCalculator(model)

compute method

compute method calculates “saliency samples” for each data x.

# x (bs, num_feature) -> saliency_samples (M, bs, num_feature)
saliency_samples = calculator.compute(x, M=1)

Here, M samples of saliency is calculated.

When calculating VanillaGrad, it suffices with M=1 since the calculation result of grad is always same. However, sampling is necessary when we consider SmoothGrad or BayesGrad.

I will explain SmoothGrad & BayesGrad to understand the notion of sampling.

– SmoothGrad –

Practically, VanillaGrad tends to show Noisy saliency map, so SmoothGrad suggests to change
input x to shift a small ϵ, resulting input x+ϵ and calculate grad. We can take the average as the final saliency score.

$$s_{mi} = \frac{dy}{dx_i} |_{x=x+\epsilon_m}$$

$$s_{i} = \frac{1}{M} \sum_{m=1}^{M}{s_{mi}}$$

In the library, compute method calculates saliency sample \(s_{mi}\), and aggregate method calculates saliency

$$s_i = \frac{1}{M} \sum_{m}^{M} s_{mi}$$

– project page:

– BayesGrad –

SmoothGrad changed input x by adding Gaussian noise, to take sampling. BayesGrad considers sampling along Neural Network parameter \(\theta\), trained with dataset D, to get prediction posterior distribution \(y_\theta \sim p(\theta|D)\) to take the sampling as follows:

$$ s_{mi} = \frac{dy_\theta}{dx_i} |_{\theta \sim p(\theta|D)} $$

$$ s_{i} = \frac{1}{M} \sum_{m=1}^{M}{s_{mi}} $$

– paper:

– code:

aggregate method

This method “aggregates” M saliency samples \(s_{mi}\) calculated by compute method, to obtain saliency \(s_i\). 

# saliency_samples (M, bs, num_feature) -> saliency (bs, num_feature)
saliency = calculator.aggregate(saliency_samples, method='raw')

Aggregation methods differ by paper by paper, aggregate method in the library supports following 3 method.

‘raw’: simply take average

$$ s_i = \frac{1}{M} \sum_{m}^{M} s_{mi} $$

‘abs’: take absolute average

$$ s_i = \frac{1}{M} \sum_{m}^{M} |s_{mi}| $$

‘square’: take squared average

$$ s_i = \frac{1}{M} \sum_{m}^{M} s_{mi}^2 $$

Visualizer class

It visualizes saliency from Calcualtor class.

– TableVisualizer: plot feature importance for each table data
– ImageVisualizer: plot saliency map of image 
– MolVisualizer: plot saliency map of molecule

As shown, Visualizer differs for each application.

visualize method

Visualizer plots figure with visualize method.

Note that Calculator class calcultes saliency with batch, but visualizer visualizes one data, so you need to specify it.

# Instantiation
visualizer = ImageVisualizer()

# Visualize `i`-th data
i = 0

The figure can be saved by setting save_filepath argument.

# Save saliency map
visualizer.visualize(saliency[i], save_filepath='saliency.png')


It was a long explanation,,, now let’s use it!

Table data application: calculate feature importance

Neural Network is MLP (Multi Layer Parceptron), Dataset is iris dataset provided by sklearn.

iris dataset is to classify 3 flower species ‘setosa’, ‘versicolor’, ‘virginica’, from 4 features ‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’.

# model
from chainer.functions import relu, dropout
from chainer_chemistry.models.mlp import MLP
from chainer_chemistry.models.prediction.classifier import Classifier

def activation_relu_dropout(h):
return dropout(relu(h), ratio=0.5)

out_dim = len(iris.target_names)
predictor = MLP(out_dim=out_dim, hidden_dim=48, n_layers=2, activation=activation_relu_dropout)
classifier = Classifier(predictor)

# dataset
import sklearn
from sklearn import datasets
import numpy as np
from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset

iris = datasets.load_iris()
# All dataset is to train for simplicity
dataset = NumpyTupleDataset(,
train = dataset

Model’s training code is omitted (please refer the code on github). After training the model, we can use saliency module.

First, use Calculator compute -> aggregate to calculate saliency.

from chainer_chemistry.saliency.calculator.gradient_calculator import GradientCalculator

# 1. instantiation
gradient_calculator = GradientCalculator(classifier)
# 2. compute
saliency_samples_vanilla = gradient_calculator.compute(train, M=1)
# 3. aggregate
saliency_vanilla = gradient_calculator.aggregate(
saliency_samples_vanilla, ch_axis=None, method='square')

Second, use Visualizer visualize method to plot figure.

from chainer_chemistry.saliency.visualizer.table_visualizer import TableVisualizer
from chainer_chemistry.saliency.visualizer.common import normalize_scaler

visualizer = TableVisualizer()
# Visualize saliency of `i`-th data
i = 0
visualizer.visualize(saliency_vanilla[i], feature_names=iris.feature_names,

We can see how the each feature contributes to the final output prediction loss.

We saw saliency for 0-th data above, now we can calculate average along dataset to show feature importance for all data (which roughly corresponds to model’s feature importance).

saliency_mean = np.mean(saliency_vanilla, axis=0)
visualizer.visualize(saliency_mean, feature_names=iris.feature_names, num_visualize=-1,

We can see “petal length” and “petal width” are more important. (note that the result differs according to the model’s training condition, be careful.)

To check above result is plausible, I tried to plot feature impotance of Random Forest from sklearn (code).

Even though the absolute importance value differs, its order is same. So I feel the saliency calculation of NN is also useful for feature selection etc 🙂

Image data: show saliency map for classification task

Training CNN takes time, so I will use pre-trained model. I will use VGG16 model provided by Chainer this time.

from import VGG16Layers

predictor = VGG16Layers()

It automatically download pretrained parameters, with only this code.

ImageNet correct label name is downloaded from here.

import numpy as np

with open('imagenet1000_clsid_to_human.txt') as f:
lines = f.readlines()

def extract_value(s):
quote_str = s[s.index(':') + 2]
return s[s.find(quote_str)+1:s.rfind(quote_str)]

classes = np.array([extract_value(line) for line in lines])

classes is 1000 class correct label as follows:

array(['tench, Tinca tinca', 'goldfish, Carassius auratus',
'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',
'tiger shark, Galeocerdo cuvieri', 'hammerhead, hammerhead shark',
'electric ray, crampfish, numbfish, torpedo', 'stingray', 'cock', ...

The images used in inference are downloaded from Pexels under CC0 license.

– Basketball image
– Bus image
– Dog image

Let’s try prediction at first.

from PIL import Image
import numpy as np

import chainer
import chainer.functions as F

# basketball, bus, dog
image_paths = ['./input/pexels-photo-945471.jpeg', './input/pexels-photo-45923.jpeg',

imgs = [ for fp in image_paths]
x = xp.asarray([ for img in imgs])
with chainer.using_config('train', False):
result = predictor.forward(x, layers=['prob'])
prob = result['prob']

lables_pred = np.argsort(cuda.to_cpu(prob.array), axis=1)[:, ::-1]

for i in range(len(lables_pred)):
print('i', i, 'labels_pred', lables_pred[i, :5], classes[lables_pred[i, :5]])
i 0 classes ['basketball' 'punching bag, punch bag, punching ball, punchball'
'rugby ball' 'barrel, cask' 'barbell']
i 1 classes ['trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi'
'passenger car, coach, carriage'
'streetcar, tram, tramcar, trolley, trolley car'
'fire engine, fire truck' 'trolleybus, trolley coach, trackless trolley']
i 2 classes ['basenji' 'Pembroke, Pembroke Welsh corgi' 'Ibizan hound, Ibizan Podenco'
'dingo, warrigal, warragal, Canis dingo' 'kelpie']

When we see the result, 1-st image is correctly predicted as Basketball, 2nd image is predicted as trailer truck though it is actually bus, 3rd image is predicted as basenji (ImageNet contains various dog’s species as label, I do not know this is indeed correct or not…).


So let’s proceed to saliency calculation. This time, I will calculate saliency for “why predicting the label of top prediction”, not for the ground truth label. For example in 2nd image, we calculate saliency for why the CNN model predicted “trailer truck”, so the ground truth label (and the model predicts correct label or not) is not related.

I can set output_var as “softmax cross entropy between top prediction label” (instead of ground truth label).

import chainer.functions as F
from chainer import cuda

def eval_fun(x):
result = predictor(x, layers=['fc8'])
out = result['fc8']
xp = cuda.get_array_module(out.array)
labels_pred = xp.argmax(out.array, axis=1).astype(xp.int32)
loss = F.softmax_cross_entropy(out, labels_pred)
return loss

Once eval_fun is defined, we can follow usual step: Calculator compute -> aggregate, ImageVisualizer visualize, to see the result.

from chainer_chemistry.saliency.calculator.gradient_calculator import GradientCalculator

# 1. instantiation
gradient_calculator = GradientCalculator(predictor, eval_fun=eval_fun, device=device)
# --- VanillaGrad ---
# 2. compute
saliency_samples_vanilla = gradient_calculator.compute(x)

# 3. aggregate
saliency_vanilla = gradient_calculator.aggregate(
saliency_samples_vanilla, ch_axis=2, method='abs')

# saliency_samples (1, 3, 3, 224, 224) -> M, minibatch, ch, h, w
print('saliency_samples', saliency_samples_vanilla.shape)
# saliency (3, 224, 224) -> minibatch, h, w
print('saliency', saliency_vanilla.shape)

We set ch_axis=2 in aggregate method, this is different from usual (minibatch, ch, h, w) image shape, because sampling_axis is added in front

ImageVisualizer visualization result is as follows:

from chainer_saliency.visualizer.image_visualizer import ImageVisualizer

visualizer = ImageVisualizer()

for index in range(len(saliency_vanilla)):
image = imgs[index].resize(saliency_vanilla[index].shape)
visualizer.visualize(saliency_vanilla[index], image, show_colorbar=False)

It looks the model focuses on right place,,, but it is too noisy to see the result.


Next, let’s calculate SmoothGrad. We can set noise_sampler argument in Calculator compute method.

from chainer_chemistry.saliency.calculator.common import GaussianNoiseSampler

M = 30

# --- SmoothGrad ---
# 2. compute
saliency_samples_smooth = gradient_calculator.compute(x, M=M, noise_sampler=GaussianNoiseSampler())

# 3. aggregate
saliency_smooth = gradient_calculator.aggregate(
saliency_samples_smooth, ch_axis=2, method='abs')

for index in range(len(saliency_vanilla)):
image = imgs[index].resize(saliency_smooth[index].shape)
visualizer.visualize(saliency_smooth[index], image, show_colorbar=False)

aggregatevisualize methods are same with VanillaGrad.

The figure looks much better, we can see model focuses on the edge of objects.


At last, we will try BayesGrad. It requires that the model has stochastic operation. This time, VGG16 has dropout operation so it is applicable.

To calculate BayesGrad, we only need to set train=True in Calculator compute method. Chainer automatically enables dropout so that output is different in each samples, results that we can calculate saliency samples (gradient) for prediction distribution.

M = 30
# --- BayesGrad ---
# 2. compute
saliency_samples_bayes = gradient_calculator.compute(x, M=M, train=True)

This time, the result is similar to VanillaGrad.

When I try combining both SmoothGrad & BayesGrad, the result are as follows:

Molecule data: plot property contribution map for regression task

For regression task, we can calculate saliency to consider its sign, to show that the input contributes to positive or negative to the prediction. 

In this last example, I will use Graph convolution model in Chainer Chemistry, to visualize water solubility contribution for each atom.

ESOL dataset is used for water solubility dataset.

import numpy as np
import chainer
from chainer.functions import relu, dropout

from chainer_chemistry.models.ggnn import GGNN
from chainer_chemistry.datasets.numpy_tuple_dataset import NumpyTupleDataset
from chainer_chemistry.datasets.zinc import get_zinc250k
from chainer_chemistry.dataset.preprocessors.ggnn_preprocessor import GGNNPreprocessor
from chainer_chemistry.models.mlp import MLP
from chainer_chemistry.models.prediction.regressor import Regressor

# Model
def activation_relu_dropout(h):
    return dropout(relu(h), ratio=0.25)

class GraphConvPredictor(chainer.Chain):
    def __init__(self, graph_conv, mlp=None):
        """Initializes the graph convolution predictor.
            graph_conv: The graph convolution network required to obtain
                        molecule feature representation.
            mlp: Multi layer perceptron; used as the final fully connected
                 layer. Set it to `None` if no operation is necessary
                 after the `graph_conv` calculation.
        super(GraphConvPredictor, self).__init__()
        with self.init_scope():
            self.graph_conv = graph_conv
            if isinstance(mlp, chainer.Link):
                self.mlp = mlp
        if not isinstance(mlp, chainer.Link):
            self.mlp = mlp

    def __call__(self, atoms, adjs):
        x = self.graph_conv(atoms, adjs)
        if self.mlp:
            x = self.mlp(x)
        return x

n_unit = 32
conv_layers = 4
class_num = 1
device = 0  # -1 for CPU

ggnn = GGNN(out_dim=n_unit, hidden_dim=n_unit, n_layers=conv_layers)
mlp = MLP(out_dim=class_num, hidden_dim=n_unit, activation=activation_relu_dropout)
predictor = GraphConvPredictor(ggnn, mlp)
regressor = Regressor(predictor, device=device)

# Dataset
preprocessor = GGNNPreprocessor()

result = get_molnet_dataset('delaney', preprocessor, labels=None, return_smiles=True)
train = result['dataset'][0]
smiles = result['smiles'][0]

After training the model (see repository for the code), we can proceed to visualization.

This time, we want to focus on contribution to the output prediction instead of loss. So we can define eval_fun to set output_var as predictor‘s output.

Also, we need to take care that input x is label of the node, gradient is not propagated until this input, we need to adopt gradient of the variable after embed layer, which is hidden layer’s variable.

In this kind of case, to set target_var as intermediate variable in the model, we can use VariableMonitorLinkHook.

I use IntegratedGradientsCalculator this time, to calculate saliency:

import chainer.functions as F

from chainer_chemistry.saliency.calculator.gradient_calculator import GradientCalculator
from chainer_chemistry.saliency.calculator.integrated_gradients_calculator import IntegratedGradientsCalculator
from chainer_chemistry.link_hooks.variable_monitor_link_hook import VariableMonitorLinkHook

def eval_fun(x, adj, t):
    pred = predictor(x, adj)
    pred_summed = F.sum(pred)
    return pred_summed

# 1. instantiation
calculator = IntegratedGradientsCalculator(
predictor, steps=5, eval_fun=eval_fun, target_extractor=VariableMonitorLinkHook(ggnn.embed, timing='post'),

Visualization results are as follows,

from chainer_chemistry.saliency.visualizer.mol_visualizer import SmilesVisualizer
from chainer_chemistry.saliency.visualizer.common import abs_max_scaler

visualizer = SmilesVisualizer()
# 2. compute
saliency_samples_vanilla = calculator.compute(
train, M=1, converter=concat_mols)
method = 'raw'
saliency_vanilla = calculator.aggregate(
saliency_samples_vanilla, ch_axis=3, method=method)
i = 153
visualizer.visualize(saliency_vanilla[i], smiles[i])

Red color shows the positive effect on solubility (Hydrophilic), blue color shows the negative effect on solubility (Hydrophobic).

Above figure matches the common sense of Hydrophilic effects usually occurs at polarization exists (OH), and we can see Hydrophobic effects where C-chain continues.


I introduced saliency module, which is highly flexible and applicable to any domain.

You can try all the examples with few machine resources, only with CPU, so please try!! (Saliency map visualization of image uses pre-trained model so only inference is necessary).

Leave a Comment

Your email address will not be published. Required fields are marked *