This tutorial corresponds to 03_custom_dataset_mlp folder in the source code.
In previous chapter we have learned how to train deep neural network using MNIST handwritten digits dataset. However, MNIST dataset has prepared by chainer utility library and you might now wonder how to prepare dataset when you want to use your own data for regression/classification task.
Chainer provides DatasetMixin class to let you define your own dataset class.
Contents
Prepare Data
In this task, we will try very simple regression task. Own dataset can be generated by create_my_dataset.py.
import os
import numpy as np
import pandas as pd
DATA_DIR = 'data'
def black_box_fn(x_data):
return np.sin(x_data) + np.random.normal(0, 0.1, x_data.shape)
if __name__ == '__main__':
if not os.path.exists(DATA_DIR):
os.mkdir(DATA_DIR)
x = np.arange(-5, 5, 0.01)
t = black_box_fn(x)
df = pd.DataFrame({'x': x, 't': t}, columns={'x', 't'})
df.to_csv(os.path.join(DATA_DIR, 'my_data.csv'), index=False)
This script will create very simple csv file named “data/my_data.csv“, with column name “x” and “t”. “x” indicates input value and “t” indicates target value to predict.
I adopted simple sin function with a little bit of Gaussian noise to generate “t” from “x”. (You may try modifying black_box_fn function to change the function to estimate.
Our task is to get a regression model of this black_box_fn.
Define MyDataset as a subclass of DatasetMixin
Now you have your own data, let’s define dataset class by inheriting DatasetMixin class provided by chainer.
Implementation
We usually implement 3 functions, such as
__init__(self, *args)
To write initialization code.__len__(self)
Trainer module (Iterator) accesses this property to calculate the training progress in epoch.get_examples(self, i)
Return i-th data here.
In our case, we can implement my_dataset.py as
import numpy as np
import pandas as pd
import chainer
class MyDataset(chainer.dataset.DatasetMixin):
def __init__(self, filepath, debug=False):
self.debug = debug
# Load the data in initialization
df = pd.read_csv(filepath)
self.data = df.values.astype(np.float32)
if self.debug:
print('[DEBUG] data: \n{}'.format(self.data))
def __len__(self):
"""return length of this dataset"""
return len(self.data)
def get_example(self, i):
"""Return i-th data"""
x, t = self.data[i]
return [x], [t]
Most important part is override function, get_example(self, i) where this function should be implemented to return only i-th data.
※ We don’t need to care about minibatch concatenation, Iterator will handle these stuffs. You only need to prepare a dataset to return i-th data :).
The above code works following,
1. We load prepared data ‘data/my_data.csv‘ (set as filepath) in __init__ function in the initialization code, and set expanded array (strictly, pandas.DataFrame class) into self.data.
2. return i-th data xi and ti as a vector with size 1 in get_example(self, i).
How does it work
The idea is simple. You can instantiate dataset with MyDataset() and then you can access i-th data by dataset[i].
It is also possible to access by slice or one dimensional vector, dataset[i:j] returns [dataset[i], dataset[i+1], …, dataset[j-1]].
if __name__ == '__main__':
# Test code
dataset = MyDataset('data/my_data.csv', debug=True)
print('Access by index dataset[1] = ', dataset[1])
print('Access by slice dataset[:3] = ', dataset[:3])
print('Access by list dataset[[3, 5]] = ', dataset[[3, 5]])
index = np.arange(3)
print('Access by numpy array dataset[[0, 1, 2]] = ', dataset[index])
# Randomly take 3 data
index = np.random.permutation(len(dataset))[:3]
print('dataset[{}] = {}'.format(index, dataset[index]))
[DEBUG] data:[[-5. 0.79404432][-4.98999977 1.03740847][-4.98000002 0.88521522]...,[ 4.96999979 -0.85200465][ 4.98000002 -1.10389316][ 4.98999977 -0.88174647]]Access by index dataset[1] = ([-4.9899998], [1.0374085])Access by slice dataset[:3] = [([-5.0], [0.79404432]), ([-4.9899998], [1.0374085]), ([-4.98], [0.88521522])]Access by list dataset[[3, 5]] = [([-4.9699998], [1.0449667]), ([-4.9499998], [0.82551986])]Access by numpy array dataset[[0, 1, 2]] = [([-5.0], [0.79404432]), ([-4.9899998], [1.0374085]), ([-4.98], [0.88521522])]
dataset[[602 377 525]] = [([1.02], [0.71344751]), ([-1.23], [-0.92034239]), ([0.25], [0.31516379])]
Flexibility of DatasetMixin – dynamic load from stolage, preprocess, data augmentation
(This my be advanced topic for now. You may skip and come back later.)
The nice part of DatasetMixin class is its flexibility. Basically you can implement anything in get_example function, and get_example is called every time when we access the data with data[i].
1. Data augmentation
This means we can write dynamic preprocessing. For example data augmentation is wll known, important Technic to avoid overfitting and get high validation score especially in image processing.
See chainer official imagenet example for the reference.
2. Dynamic load from storage
If you are dealing with very big size data, and all data cannot be expanded in memory at once, the best practice is to load the data each time when necessary (when creating minibatch).
We can achieve this procedure easy with DatasetMixin class. Simply, you can write loading code in get_example function to load i-th data from storage that’s all!
Refer dataset_introduction.ipynb if you want to know more about dataset class.