This tutorial corresponds to 03_custom_dataset_mlp
folder in the source code.
In previous chapter we have learned how to train deep neural network using MNIST handwritten digits dataset. However, MNIST dataset has prepared by chainer utility library and you might now wonder how to prepare dataset when you want to use your own data for regression/classification task.
Chainer provides DatasetMixin
class to let you define your own dataset class.
Contents
Prepare Data
In this task, we will try very simple regression task. Own dataset can be generated by create_my_dataset.py.
import os import numpy as np import pandas as pd DATA_DIR = 'data' def black_box_fn(x_data): return np.sin(x_data) + np.random.normal(0, 0.1, x_data.shape) if __name__ == '__main__': if not os.path.exists(DATA_DIR): os.mkdir(DATA_DIR) x = np.arange(-5, 5, 0.01) t = black_box_fn(x) df = pd.DataFrame({'x': x, 't': t}, columns={'x', 't'}) df.to_csv(os.path.join(DATA_DIR, 'my_data.csv'), index=False)
This script will create very simple csv file named “data/my_data.csv
“, with column name “x” and “t”. “x” indicates input value and “t” indicates target value to predict.
I adopted simple sin
function with a little bit of Gaussian noise to generate “t” from “x”. (You may try modifying black_box_fn
function to change the function to estimate.
Our task is to get a regression model of this black_box_fn
.
Define MyDataset as a subclass of DatasetMixin
Now you have your own data, let’s define dataset class by inheriting DatasetMixin
class provided by chainer.
Implementation
We usually implement 3 functions, such as
__init__(self, *args)
To write initialization code.__len__(self)
Trainer module (Iterator
) accesses this property to calculate the training progress in epoch.get_examples(self, i)
Return i-th data here.
In our case, we can implement my_dataset.py as
import numpy as np import pandas as pd import chainer class MyDataset(chainer.dataset.DatasetMixin): def __init__(self, filepath, debug=False): self.debug = debug # Load the data in initialization df = pd.read_csv(filepath) self.data = df.values.astype(np.float32) if self.debug: print('[DEBUG] data: \n{}'.format(self.data)) def __len__(self): """return length of this dataset""" return len(self.data) def get_example(self, i): """Return i-th data""" x, t = self.data[i] return [x], [t]
Most important part is override function, get_example(self, i)
where this function should be implemented to return only i-th data.
※ We don’t need to care about minibatch concatenation, Iterator will handle these stuffs. You only need to prepare a dataset to return i-th data :).
The above code works following,
1. We load prepared data ‘data/my_data.csv
‘ (set as filepath
) in __init__
function in the initialization code, and set expanded array (strictly, pandas.DataFrame
class) into self.data
.
2. return i-th data xi and ti as a vector with size 1 in get_example(self, i)
.
How does it work
The idea is simple. You can instantiate dataset with MyDataset()
and then you can access i-th data by dataset[i]
.
It is also possible to access by slice or one dimensional vector, dataset[i:j] returns [dataset[i], dataset[i+1], …, dataset[j-1]].
if __name__ == '__main__': # Test code dataset = MyDataset('data/my_data.csv', debug=True) print('Access by index dataset[1] = ', dataset[1]) print('Access by slice dataset[:3] = ', dataset[:3]) print('Access by list dataset[[3, 5]] = ', dataset[[3, 5]]) index = np.arange(3) print('Access by numpy array dataset[[0, 1, 2]] = ', dataset[index]) # Randomly take 3 data index = np.random.permutation(len(dataset))[:3] print('dataset[{}] = {}'.format(index, dataset[index]))
[DEBUG] data:
[[-5. 0.79404432]
[-4.98999977 1.03740847]
[-4.98000002 0.88521522]
...,
[ 4.96999979 -0.85200465]
[ 4.98000002 -1.10389316]
[ 4.98999977 -0.88174647]]
Access by index dataset[1] = ([-4.9899998], [1.0374085])
Access by slice dataset[:3] = [([-5.0], [0.79404432]), ([-4.9899998], [1.0374085]), ([-4.98], [0.88521522])]
Access by list dataset[[3, 5]] = [([-4.9699998], [1.0449667]), ([-4.9499998], [0.82551986])]
Access by numpy array dataset[[0, 1, 2]] = [([-5.0], [0.79404432]), ([-4.9899998], [1.0374085]), ([-4.98], [0.88521522])]
dataset[[602 377 525]] = [([1.02], [0.71344751]), ([-1.23], [-0.92034239]), ([0.25], [0.31516379])]
Flexibility of DatasetMixin – dynamic load from stolage, preprocess, data augmentation
(This my be advanced topic for now. You may skip and come back later.)
The nice part of DatasetMixin class is its flexibility. Basically you can implement anything in get_example
function, and get_example
is called every time when we access the data with data[i].
1. Data augmentation
This means we can write dynamic preprocessing. For example data augmentation is wll known, important Technic to avoid overfitting and get high validation score especially in image processing.
See chainer official imagenet example for the reference.
2. Dynamic load from storage
If you are dealing with very big size data, and all data cannot be expanded in memory at once, the best practice is to load the data each time when necessary (when creating minibatch).
We can achieve this procedure easy with DatasetMixin
class. Simply, you can write loading code in get_example
function to load i-th data from storage that’s all!
Refer dataset_introduction.ipynb if you want to know more about dataset class.