Create dataset class from your own data with DatasetMixin

This tutorial corresponds to 03_custom_dataset_mlp folder in the source code.

In previous chapter we have learned how to train deep neural network using MNIST handwritten digits dataset. However, MNIST dataset has prepared by chainer utility library and you might now wonder how to prepare dataset when you want to use your own data for regression/classification task.

Chainer provides DatasetMixin class to let you define your own dataset class.

Prepare Data

In this task, we will try very simple regression task. Own dataset can be generated by create_my_dataset.py

import os
import numpy as np
import pandas as pd


DATA_DIR = 'data'


def black_box_fn(x_data):
    return np.sin(x_data) + np.random.normal(0, 0.1, x_data.shape)


if __name__ == '__main__':
    if not os.path.exists(DATA_DIR):
        os.mkdir(DATA_DIR)

    x = np.arange(-5, 5, 0.01)
    t = black_box_fn(x)
    df = pd.DataFrame({'x': x, 't': t}, columns={'x', 't'})
    df.to_csv(os.path.join(DATA_DIR, 'my_data.csv'), index=False)


This script will create very simple csv file named “data/my_data.csv“, with column name “x” and “t”. “x” indicates input value and “t” indicates target value to predict. 

I adopted simple sin function with a little bit of Gaussian noise to generate “t” from “x”. (You may try modifying black_box_fn function to change the function to estimate.

Our task is to get a regression model of this black_box_fn.

Define MyDataset as a subclass of DatasetMixin

Now you have your own data, let’s define dataset class by inheriting DatasetMixin class provided by chainer.

Implementation

We usually implement 3 functions, such as

  • __init__(self, *args)
    To write initialization code.
  • __len__(self)
    Trainer module (Iterator) accesses this property to calculate the training progress in epoch. 
  • get_examples(self, i)
    Return i-th data here.

In our case, we can implement my_dataset.py as

import numpy as np
import pandas as pd

import chainer


class MyDataset(chainer.dataset.DatasetMixin):

    def __init__(self, filepath, debug=False):
        self.debug = debug
        # Load the data in initialization
        df = pd.read_csv(filepath)
        self.data = df.values.astype(np.float32)
        if self.debug:
            print('[DEBUG] data: \n{}'.format(self.data))

    def __len__(self):
        """return length of this dataset"""
        return len(self.data)

    def get_example(self, i):
        """Return i-th data"""
        x, t = self.data[i]
        return [x], [t]


Most important part is override function, get_example(self, i) where this function should be implemented to return only i-th data.

※ We don’t need to care about minibatch concatenation, Iterator will handle these stuffs. You only need to prepare a dataset to return i-th data :). 

The above code works following,

1. We load prepared data ‘data/my_data.csv‘ (set as filepath) in __init__ function in the initialization code, and set expanded array (strictly, pandas.DataFrame class) into self.data.

2. return i-th data xi and ti as a vector with size 1 in get_example(self, i).

How does it work

The idea is simple. You can instantiate dataset with MyDataset() and then you can access i-th data by dataset[i].

It is also possible to access by slice or one dimensional vector, dataset[i:j] returns [dataset[i], dataset[i+1], …, dataset[j-1]].

if __name__ == '__main__':
    # Test code
    dataset = MyDataset('data/my_data.csv', debug=True)

    print('Access by index dataset[1] = ', dataset[1])
    print('Access by slice dataset[:3] = ', dataset[:3])
    print('Access by list dataset[[3, 5]] = ', dataset[[3, 5]])
    index = np.arange(3)
    print('Access by numpy array dataset[[0, 1, 2]] = ', dataset[index])
    # Randomly take 3 data
    index = np.random.permutation(len(dataset))[:3]
    print('dataset[{}] = {}'.format(index, dataset[index]))

[DEBUG] data:
[[-5. 0.79404432]
[-4.98999977 1.03740847]
[-4.98000002 0.88521522]
...,
[ 4.96999979 -0.85200465]
[ 4.98000002 -1.10389316]
[ 4.98999977 -0.88174647]]
Access by index dataset[1] = ([-4.9899998], [1.0374085])
Access by slice dataset[:3] = [([-5.0], [0.79404432]), ([-4.9899998], [1.0374085]), ([-4.98], [0.88521522])]
Access by list dataset[[3, 5]] = [([-4.9699998], [1.0449667]), ([-4.9499998], [0.82551986])]
Access by numpy array dataset[[0, 1, 2]] = [([-5.0], [0.79404432]), ([-4.9899998], [1.0374085]), ([-4.98], [0.88521522])]
dataset[[602 377 525]] = [([1.02], [0.71344751]), ([-1.23], [-0.92034239]), ([0.25], [0.31516379])]

Flexibility of DatasetMixin – dynamic load from stolage, preprocess, data augmentation

(This my be advanced topic for now. You may skip and come back later.)

The nice part of DatasetMixin class is its flexibility. Basically you can implement anything in get_example function, and get_example is called every time when we access the data with data[i]. 

1. Data augmentation

This means we can write dynamic preprocessing. For example data augmentation is wll known, important Technic to avoid overfitting and get high validation score especially in image processing. 

See chainer official imagenet example for the reference.

2. Dynamic load from storage

If you are dealing with very big size data, and all data cannot be expanded in memory at once, the best practice is to load the data each time when necessary (when creating minibatch).

We can achieve this procedure easy with DatasetMixin class. Simply, you can write loading code in get_example function to load i-th data from storage that’s all!

Refer dataset_introduction.ipynb if you want to know more about dataset class.

Next: Training code for MyDataset

Leave a Comment

Your email address will not be published. Required fields are marked *