Penn Tree Bank (PTB) dataset introduction

This post is based on the jupyter notebook ptb_dataset_introduction.ipynb uploaded on github.

Penn Treebank dataset, known as PTB dataset, is widely used in machine learning of NLP (Natural Language Processing) research.

Dataset if provided by the official page: Treebank-3

In Chainer, PTB dataset can be obtained with build-in function.

Let’s see the dataset structure.

from __future__ import print_function
import os
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import chainer

Download PTB dataset

chainer.datasets.get_ptb_words method is prepared in Chainer to get PTB dataset. Dataset is automatically downloaded from https://github.com/tomsercu/lstm/tree/master/data only for the first time, and its cache is used from second time.

train, val, test = chainer.datasets.get_ptb_words()

The dataset structure is numpy.ndarray.

train[i] represents i-th word in integer, which represents word ID.

print('train type: ', type(train), train.shape, train)
print('val   type: ', type(val), val.shape, val)
print('test  type: ', type(test), test.shape, test)

train type: <class 'numpy.ndarray'> (929589,) [ 0 1 2 ..., 39 26 24]
val type: <class 'numpy.ndarray'> (73760,) [2211 396 1129 ..., 108 27 24]
test type: <class 'numpy.ndarray'> (82430,) [142 78 54 ..., 87 214 24]

Word ID and word correspondence

Each word ID corresponds to specific word or symbol.

Symbol includes following

  • <eos> : end of sequence
  • <unk> : unknown word (I guess it is the word which was not in the 10000 vocabulary).

The relation between word ID and actual word can be obtained as dictionary with chainer.datasets.get_ptb_words_vocabulary()method.

ptb_dict = chainer.datasets.get_ptb_words_vocabulary()
print('Number of vocabulary', len(ptb_dict))
print('ptb_dict', ptb_dict)

Number of vocabulary 10000
ptb_dict {'representation': 7975, 'competent': 9733, 'unusual': 2825, 're-election': 2672, 'brewing': 7045, 'stunning': 9451, 'distributed': 6252, 'percentage': 72, 'compare': 2549, 'laughing': 3407, 'sci': 3311, 'suggested': 2611, 'incompetent': 9769, 'sandinistas': 9108, 'werner': 8877, 'poison': 6210, 'salon': 3963, 'now': 145, 'crest': 8679, 'dairy': 2018, 'lineup': 9597, 'hills': 1264, 'chip': 1157, 'creditor': 1374, 'actor': 2315, 'specialist': 2737, "'s": 119, 'flooded': 3700, 'aba': 3364, ... }

Convert to word sequences

Check original sentense by converting back word ID to word using ptb dictionary.

Train text

It is same with https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt

ptb_word_id_dict = ptb_dict
ptb_id_word_dict = dict((v,k) for k,v in ptb_word_id_dict.items())

# Same with https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.train.txt
print([ptb_id_word_dict[i] for i in train[:30]])

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old']

Now you can see that the sequence of word id is indeed a list of word which forms a meaningful text.

But list representation is little bit difficult to read for human, let’s convert to natural text using ' '.join() method.

# ' '.join() will convert list representation more readable

' '.join([ptb_id_word_dict[i] for i in train[:300]])

"aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter <eos> pierre <unk> N years old will join the board as a nonexecutive director nov. N <eos> mr. <unk> is chairman of <unk> n.v. the dutch publishing group <eos> rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate <eos> a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported <eos> the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said <eos> <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N <eos> although preliminary findings were reported more than a year ago the latest results appear in today 's new england journal of medicine a forum likely to bring new attention to the problem <eos> a <unk> <unk> said this is an old story <eos> we 're talking about years ago before anyone heard of asbestos having any questionable properties <eos> there is no asbestos in our products now <eos> neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes <eos> we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute <eos> dr. <unk> led a team of researchers from the national cancer institute and the medical schools of harvard university and boston university <eos> the <unk>"

Validation data text

It is same with https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.valid.txt

print(' '.join([ptb_id_word_dict[i] for i in val[:300]]))


consumers may want to move their telephones a little closer to the tv set <eos> <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> <eos> two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues <eos> and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show <eos> interactive telephone technology has taken a new leap in <unk> and television programmers are racing to exploit the possibilities <eos> eventually viewers may grow <unk> with the technology and <unk> the cost <eos> but right now programmers are figuring that viewers who are busy dialing up a range of services may put down their <unk> control <unk> and stay <unk> <eos> we 've been spending a lot of time in los angeles talking to tv production people says mike parks president of call interactive which supplied technology for both abc sports and nbc 's consumer minutes <eos> with the competitiveness of the television market these days everyone is looking for a way to get viewers more excited <eos> one of the leaders behind the expanded use of N numbers is call interactive a joint venture of giants american express co. and american telephone & telegraph co <eos> formed in august the venture <unk> at&t 's newly expanded N service with N <unk> computers in american express 's omaha neb. service center <eos> other long-distance carriers have also begun marketing enhanced N service and special consultants are <unk> up to exploit the new tool <eos> blair entertainment a new york firm that advises tv stations and sells ads for them has just formed a

Test data text

It is same with https://raw.githubusercontent.com/tomsercu/lstm/master/data/ptb.test.txt

print(' '.join([ptb_id_word_dict[i] for i in test[:300]]))


no it was n't black monday <eos> but while the new york stock exchange did n't fall apart friday as the dow jones industrial average plunged N points most of it in the final hour it barely managed to stay this side of chaos <eos> some circuit breakers installed after the october N crash failed their first test traders say unable to cool the selling panic in both stocks and futures <eos> the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure <eos> big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say <eos> heavy selling of standard & poor 's 500-stock index futures in chicago <unk> beat stocks downward <eos> seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed <eos> the <unk> has already begun <eos> the equity market was <unk> <eos> once again the specialists were not able to handle the imbalances on the floor of the new york stock exchange said christopher <unk> senior vice president at <unk> securities corp <eos> <unk> james <unk> chairman of specialists henderson brothers inc. it is easy to say the specialist is n't doing his job <eos> when the dollar is in a <unk> even central banks ca n't stop it <eos> speculators are calling for a degree of liquidity that is not there in the market <eos> many money managers and some traders had already left their offices early friday afternoon on a warm autumn day because the stock market was so quiet <eos> then in a <unk> plunge the dow

Leave a Comment

Your email address will not be published. Required fields are marked *