Understanding convolutional layer

Source code is uploaded on github.
The sample image is obtained from PEXELS.

What is the difference between convolutional layer and linear layer? What kind of intuition is in behind of using convolutional layer in deep neural network?

This hands on shows some effects by convolutional layer to provide some intution about what convolutional layer do.





Above type of diagram often appears in Convolutional neural network field. Below figure explains its notation.


Cuboid represents the “image” array where this image might not mean the meaningful picture. Horizontal axis represents channel number, vertical axis for image height and depth axis for image width respectively.


Convolution layer – basic usage

Input format of convolutional layer is in the order, (batch index, channel, height, width). Since openCV image format is in the order (height, width, channel), this dimension order need to be converted to input to convolution layer.

It can be done by using transpose method.

L.Convolution2D(in_channels, out_channels, ksize)

  • in_channels: input channel number.
  • out_channels: output channel number.
  • ksize: kernel size.

also, following parameters is often set

  • pad: padding
  • stride: stride

To understand the behavior of convolution layer, I recommend to see the animation on conv_arithmetic.

image.shape (Height, Width, Channel) = (380, 512, 3)
image shape (1, 3, 380, 512)
shape (1, 3, 376, 508)
shape 2 (376, 508, 3)output_conv1


Convolution2D layer takes 4-dim array as input and outputs 4-dim array. Graphical meaning of this input-output relation ship is drawn in below figure.


When the in_channels is set to None, its size is determined at the first time when it is used. i.e., out_image = conv1(image).data in above code.

The internal parameter W is initialized randomly at that time. As you can see, output_conv1.jpg shows the result after random filter is applied.

Some “feature” can be extracted by applying convolution layer.

For example, random fileter sometimes acts as “blurring” or “edge extracting” image.

To understand the intuitive meaning of convolutional layer in more detail, please see below example.

gray_image.shape (Height, Width) = (380, 512)
[[[[-0.17837302 0.2948513 -0.0661072 ]
    [ 0.02076577 -0.14251317 -0.05151904]
    [ 0.01675515 0.07612066 0.37937522]]]]
image.shape (1, 3, 380, 512)
out_image_v.shape (1, 1, 378, 510)
out_image_v.shape (after transpose) (378, 510, 1)




As you can see from the result, each convolution layer acts as emphasizing/extracting the color difference along specific direction. In this way “filter”, also called “kernel” can be considered as feature extractor.

Convolution with stride

The default value of stride is 1. If this value is specified, convolution layer will reduce output image size.

Practically, stride=2 is often used to generate the output image of the height & width almost half of the input image.



image.shape (Height, Width, Channel) = (1, 3, 380, 512)
input image.shape (1, 3, 380, 512)
out_image.shape (1, 5, 187, 253)




As written in the Chainer docs, the input and output shape relation is given in below formula:

$$ h_O = (h_I + 2h_P - h_K) / s_Y + 1 $$$$ w_O = (w_I + 2w_P - w_K) / s_X + 1 $$

where each symbol means that

  • \(h \): height
  • \(w \): width
  • \(I \): input
  • \(O \): output
  • \(P \): padding
  • \(K \): kernel size
  • \(s \): stride

Max pooling

Convolution layer with stride can be used to look wide range feature, another popular method is to use max pooling.

Max pooling function extracts the maximum value in the kernel, and it dispose the rest pixel’s information.

This behavior is beneficial to impose translational symmetry. For example, consider the dog’s picture. Even if the each pixel shifted one pixel, is should be still recognized as dog. So traslational symmetry can be exploited to reduce model’s calculation time and number of internal parameters for image classification task.

image.shape (Height, Width, Channel) = (1, 3, 380, 512)
input image.shape (1, 3, 380, 512)
out_image.shape (1, 3, 190, 256)


Convolutional neural network

By combining above functions with non-linear activation units, Convolutional Neural Network (CNN) can be constructed.

For non-linear activation, relu, leaky_relu, sigmoid or tanh are often used.

input image.shape (1, 3, 380, 512)
out_image.shape (1, 5, 47, 63)



Let’s see how this CNN can be used for image classification in the following. Before that, next post explains CIFAR-10, CIFAR-100 dataset which are famous image classification dataset for research.

Next: CIFAR-10, CIFAR-100 dataset introduction


Sponsored Links

Leave a Reply

Your email address will not be published.