Penn Tree Bank (PTB) dataset introduction

This post is based on the jupyter notebook ptb_dataset_introduction.ipynb uploaded on github.

Penn Treebank dataset, known as PTB dataset, is widely used in machine learning of NLP (Natural Language Processing) research.

Dataset if provided by the official page: Treebank-3

In Chainer, PTB dataset can be obtained with build-in function.

Let’s see the dataset structure.

from __future__ import print_function
import os
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import chainer

Download PTB dataset

chainer.datasets.get_ptb_words method is prepared in Chainer to get PTB dataset. Dataset is automatically downloaded from only for the first time, and its cache is used from second time.

train, val, test = chainer.datasets.get_ptb_words()

The dataset structure is numpy.ndarray.

train[i] represents i-th word in integer, which represents word ID.

print('train type: ', type(train), train.shape, train)
print('val   type: ', type(val), val.shape, val)
print('test  type: ', type(test), test.shape, test)

train type: <class 'numpy.ndarray'> (929589,) [ 0 1 2 ..., 39 26 24]
val type: <class 'numpy.ndarray'> (73760,) [2211 396 1129 ..., 108 27 24]
test type: <class 'numpy.ndarray'> (82430,) [142 78 54 ..., 87 214 24]

Word ID and word correspondence

Each word ID corresponds to specific word or symbol.

Symbol includes following

  • <eos> : end of sequence
  • <unk> : unknown word (I guess it is the word which was not in the 10000 vocabulary).

The relation between word ID and actual word can be obtained as dictionary with chainer.datasets.get_ptb_words_vocabulary()method.

ptb_dict = chainer.datasets.get_ptb_words_vocabulary()
print('Number of vocabulary', len(ptb_dict))
print('ptb_dict', ptb_dict)

Number of vocabulary 10000
ptb_dict {'representation': 7975, 'competent': 9733, 'unusual': 2825, 're-election': 2672, 'brewing': 7045, 'stunning': 9451, 'distributed': 6252, 'percentage': 72, 'compare': 2549, 'laughing': 3407, 'sci': 3311, 'suggested': 2611, 'incompetent': 9769, 'sandinistas': 9108, 'werner': 8877, 'poison': 6210, 'salon': 3963, 'now': 145, 'crest': 8679, 'dairy': 2018, 'lineup': 9597, 'hills': 1264, 'chip': 1157, 'creditor': 1374, 'actor': 2315, 'specialist': 2737, "'s": 119, 'flooded': 3700, 'aba': 3364, ... }

Convert to word sequences

Check original sentense by converting back word ID to word using ptb dictionary.

Train text

It is same with

ptb_word_id_dict = ptb_dict
ptb_id_word_dict = dict((v,k) for k,v in ptb_word_id_dict.items())

# Same with
print([ptb_id_word_dict[i] for i in train[:30]])

['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter', '<eos>', 'pierre', '<unk>', 'N', 'years', 'old']

Now you can see that the sequence of word id is indeed a list of word which forms a meaningful text.

But list representation is little bit difficult to read for human, let’s convert to natural text using ' '.join() method.

# ' '.join() will convert list representation more readable

' '.join([ptb_id_word_dict[i] for i in train[:300]])

"aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter <eos> pierre <unk> N years old will join the board as a nonexecutive director nov. N <eos> mr. <unk> is chairman of <unk> n.v. the dutch publishing group <eos> rudolph <unk> N years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate <eos> a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported <eos> the asbestos fiber <unk> is unusually <unk> once it enters the <unk> with even brief exposures to it causing symptoms that show up decades later researchers said <eos> <unk> inc. the unit of new york-based <unk> corp. that makes kent cigarettes stopped using <unk> in its <unk> cigarette filters in N <eos> although preliminary findings were reported more than a year ago the latest results appear in today 's new england journal of medicine a forum likely to bring new attention to the problem <eos> a <unk> <unk> said this is an old story <eos> we 're talking about years ago before anyone heard of asbestos having any questionable properties <eos> there is no asbestos in our products now <eos> neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes <eos> we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute <eos> dr. <unk> led a team of researchers from the national cancer institute and the medical schools of harvard university and boston university <eos> the <unk>"

Validation data text

It is same with

print(' '.join([ptb_id_word_dict[i] for i in val[:300]]))

consumers may want to move their telephones a little closer to the tv set <eos> <unk> <unk> watching abc 's monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> <eos> two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues <eos> and the new syndicated reality show hard copy records viewers ' opinions for possible airing on the next day 's show <eos> interactive telephone technology has taken a new leap in <unk> and television programmers are racing to exploit the possibilities <eos> eventually viewers may grow <unk> with the technology and <unk> the cost <eos> but right now programmers are figuring that viewers who are busy dialing up a range of services may put down their <unk> control <unk> and stay <unk> <eos> we 've been spending a lot of time in los angeles talking to tv production people says mike parks president of call interactive which supplied technology for both abc sports and nbc 's consumer minutes <eos> with the competitiveness of the television market these days everyone is looking for a way to get viewers more excited <eos> one of the leaders behind the expanded use of N numbers is call interactive a joint venture of giants american express co. and american telephone & telegraph co <eos> formed in august the venture <unk> at&t 's newly expanded N service with N <unk> computers in american express 's omaha neb. service center <eos> other long-distance carriers have also begun marketing enhanced N service and special consultants are <unk> up to exploit the new tool <eos> blair entertainment a new york firm that advises tv stations and sells ads for them has just formed a

Test data text

It is same with

print(' '.join([ptb_id_word_dict[i] for i in test[:300]]))

no it was n't black monday <eos> but while the new york stock exchange did n't fall apart friday as the dow jones industrial average plunged N points most of it in the final hour it barely managed to stay this side of chaos <eos> some circuit breakers installed after the october N crash failed their first test traders say unable to cool the selling panic in both stocks and futures <eos> the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure <eos> big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say <eos> heavy selling of standard & poor 's 500-stock index futures in chicago <unk> beat stocks downward <eos> seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed <eos> the <unk> has already begun <eos> the equity market was <unk> <eos> once again the specialists were not able to handle the imbalances on the floor of the new york stock exchange said christopher <unk> senior vice president at <unk> securities corp <eos> <unk> james <unk> chairman of specialists henderson brothers inc. it is easy to say the specialist is n't doing his job <eos> when the dollar is in a <unk> even central banks ca n't stop it <eos> speculators are calling for a degree of liquidity that is not there in the market <eos> many money managers and some traders had already left their offices early friday afternoon on a warm autumn day because the stock market was so quiet <eos> then in a <unk> plunge the dow

Leave a Comment

Your email address will not be published. Required fields are marked *