This post mainly explains train_ptb.py, uploaded on github.
We have already learned RNN and LSTM network architecture, let’s apply it to PTB dataset.
It is quite similar to train_simple_sequence.py explained in Training RNN with simple sequence dataset, so no much explanation is necessary.
Contents
Train code
I will just paste whole the training code for PTB at first,
"""
RNN Training code with Penn Treebank (ptb) dataset
Ref: https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb.py
"""
from __future__ import print_function
import os
import sys
import argparse
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import chainer
import chainer.functions as F
import chainer.links as L
from chainer import training, iterators, serializers, optimizers
from chainer.training import extensions
sys.path.append(os.pardir)
from RNN import RNN
from RNN2 import RNN2
from RNN3 import RNN3
from RNNForLM import RNNForLM
from parallel_sequential_iterator import ParallelSequentialIterator
from bptt_updater import BPTTUpdater
# Routine to rewrite the result dictionary of LogReport to add perplexity
# values
def compute_perplexity(result):
result['perplexity'] = np.exp(result['main/loss'])
if 'validation/main/loss' in result:
result['val_perplexity'] = np.exp(result['validation/main/loss'])
def main():
archs = {
'rnn': RNN,
'rnn2': RNN2,
'rnn3': RNN3,
'lstm': RNNForLM
}
parser = argparse.ArgumentParser(description='RNN example')
parser.add_argument('--arch', '-a', choices=archs.keys(),
default='rnn', help='Net architecture')
parser.add_argument('--unit', '-u', type=int, default=100,
help='Number of RNN units in each layer')
parser.add_argument('--bproplen', '-l', type=int, default=20,
help='Number of words in each mini-batch '
'(= length of truncated BPTT)')
parser.add_argument('--batchsize', '-b', type=int, default=10,
help='Number of images in each mini-batch')
parser.add_argument('--epoch', '-e', type=int, default=10,
help='Number of sweeps over the dataset to train')
parser.add_argument('--gpu', '-g', type=int, default=-1,
help='GPU ID (negative value indicates CPU)')
parser.add_argument('--out', '-o', default='result',
help='Directory to output the result')
parser.add_argument('--resume', '-r', default='',
help='Resume the training from snapshot')
args = parser.parse_args()
print('GPU: {}'.format(args.gpu))
print('# Architecture: {}'.format(args.arch))
print('# Minibatch-size: {}'.format(args.batchsize))
print('# epoch: {}'.format(args.epoch))
# 1. Load dataset: Penn Tree Bank long word sequence dataset
train, val, test = chainer.datasets.get_ptb_words()
n_vocab = max(train) + 1 # train is just an array of integers
print('# vocab: {}'.format(n_vocab))
print('')
# 2. Setup model
model = archs[args.arch](n_vocab=n_vocab,
n_units=args.unit) # , activation=F.tanh
classifier_model = L.Classifier(model)
classifier_model.compute_accuracy = False # we only want the perplexity
if args.gpu >= 0:
chainer.cuda.get_device(args.gpu).use() # Make a specified GPU current
classifier_model.to_gpu() # Copy the model to the GPU
eval_classifier_model = classifier_model.copy() # Model with shared params and distinct states
eval_model = classifier_model.predictor
# 2. Setup an optimizer
optimizer = optimizers.Adam(alpha=0.001)
#optimizer = optimizers.MomentumSGD()
optimizer.setup(classifier_model)
# 4. Setup an Iterator
train_iter =ParallelSequentialIterator(train, args.batchsize)
val_iter = ParallelSequentialIterator(val, 1, repeat=False)
test_iter = ParallelSequentialIterator(test, 1, repeat=False)
# 5. Setup an Updater
updater = BPTTUpdater(train_iter, optimizer, args.bproplen, args.gpu)
# 6. Setup a trainer (and extensions)
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)
# Evaluate the model with the test dataset for each epoch
trainer.extend(extensions.Evaluator(val_iter, eval_classifier_model,
device=args.gpu,
# Reset the RNN state at the beginning of each evaluation
eval_hook=lambda _: eval_model.reset_state())
)
trainer.extend(extensions.dump_graph('main/loss'))
trainer.extend(extensions.snapshot(), trigger=(1, 'epoch'))
interval = 500
trainer.extend(extensions.LogReport(postprocess=compute_perplexity,
trigger=(interval, 'iteration')))
trainer.extend(extensions.PrintReport(
['epoch', 'iteration', 'perplexity', 'val_perplexity', 'elapsed_time']
), trigger=(interval, 'iteration'))
trainer.extend(extensions.PlotReport(
['perplexity', 'val_perplexity'],
x_key='epoch', file_name='perplexity.png'))
trainer.extend(extensions.ProgressBar(update_interval=10))
# Resume from a snapshot
if args.resume:
serializers.load_npz(args.resume, trainer)
# Run the training
trainer.run()
serializers.save_npz('{}/{}_ptb.model'
.format(args.out, args.arch), model)
# Evaluate the final model
print('test')
eval_model.reset_state()
evaluator = extensions.Evaluator(test_iter, eval_classifier_model, device=args.gpu)
result = evaluator()
print('test perplexity:', np.exp(float(result['main/loss'])))
if __name__ == '__main__':
main()
I will explain different point from simple_sequence dataset in the following.
PTB dataset preparation: train, validation and test
Dataset preparation is done by get_ptb_words method provided by chainer,
# 1. Load dataset: Penn Tree Bank long word sequence dataset
train, val, test = chainer.datasets.get_ptb_words()
n_vocab = max(train) + 1 # train is just an array of integers
print('# vocab: {}'.format(n_vocab))
print('')
Note that PTB dataset consists of train, validation and test data, while previous project like MNIST, CIFAR-10, CIFAR-100 consisted of train and test data.
In above training code, we use train dataset for train the model, validation dataset to monitor the validation loss during the training (for example you may tune hyper parameter using validation loss), and test dataset only after the training is completely finished to just check/evaluate the model’s performance.
Monitor the loss by perplexity
In NLP, it is common to measure the model’s performance by perplexity, instead of softmax cross entropy or correct percentage.
Perplexity of a probability distribution
The perplexity of a discrete probability distribution p is defined as
Perplexity per word
In natural language processing, perplexity is a way of evaluating language models. A language model is a probability distribution over entire sentences or texts.
cite from https://en.wikipedia.org/wiki/Perplexity
It is calculated easily by just take exponential of the mean softmax cross entropy loss
result['perplexity'] = np.exp(result['main/loss'])
and in chainer, we can show it by LogReport extension.
It is done by passing post processing function “compute_perplexity” into LogReport argument.
# Routine to rewrite the result dictionary of LogReport to add perplexity
# values
def compute_perplexity(result):
result['perplexity'] = np.exp(result['main/loss'])
if 'validation/main/loss' in result:
result['val_perplexity'] = np.exp(result['validation/main/loss'])
...
interval = 500
trainer.extend(extensions.LogReport(postprocess=compute_perplexity,
trigger=(interval, 'iteration')))
LogReport‘s postprocess argument will take a function, where the function will take the argument “result” which is a dictionary containing the repoted value.
Since ‘main/loss’ and ‘validation/main/loss’ is reported by Classifier and Evaluator, we can extract these values from result to calculate perplexity and val_perplexity. When it is set to result dictionary, it can be shown by PrintReport by the same key name.