|
 |
|
 |
|
 |
|
|
|
<font size=4><b>Language Model on One Billion Word Benchmark</b></font> |
|
|
|
<b>Authors:</b> |
|
|
|
Oriol Vinyals ([email protected], github: OriolVinyals), |
|
Xin Pan |
|
|
|
<b>Paper Authors:</b> |
|
|
|
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu |
|
|
|
<b>TL;DR</b> |
|
|
|
This is a pretrained model on One Billion Word Benchmark. |
|
If you use this model in your publication, please cite the original paper: |
|
|
|
@article{jozefowicz2016exploring, |
|
title={Exploring the Limits of Language Modeling}, |
|
author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike |
|
and Shazeer, Noam and Wu, Yonghui}, |
|
journal={arXiv preprint arXiv:1602.02410}, |
|
year={2016} |
|
} |
|
|
|
<b>Introduction</b> |
|
|
|
In this release, we open source a model trained on the One Billion Word |
|
Benchmark (http://arxiv.org/abs/1312.3005), a large language corpus in English |
|
which was released in 2013. This dataset contains about one billion words, and |
|
has a vocabulary size of about 800K words. It contains mostly news data. Since |
|
sentences in the training set are shuffled, models can ignore the context and |
|
focus on sentence level language modeling. |
|
|
|
In the original release and subsequent work, people have used the same test set |
|
to train models on this dataset as a standard benchmark for language modeling. |
|
Recently, we wrote an article (http://arxiv.org/abs/1602.02410) describing a |
|
model hybrid between character CNN, a large and deep LSTM, and a specific |
|
Softmax architecture which allowed us to train the best model on this dataset |
|
thus far, almost halving the best perplexity previously obtained by others. |
|
|
|
<b>Code Release</b> |
|
|
|
The open-sourced components include: |
|
|
|
* TensorFlow GraphDef proto buffer text file. |
|
* TensorFlow pre-trained checkpoint shards. |
|
* Code used to evaluate the pre-trained model. |
|
* Vocabulary file. |
|
* Test set from LM-1B evaluation. |
|
|
|
The code supports 4 evaluation modes: |
|
|
|
* Given provided dataset, calculate the model's perplexity. |
|
* Given a prefix sentence, predict the next words. |
|
* Dump the softmax embedding, character-level CNN word embeddings. |
|
* Give a sentence, dump the embedding from the LSTM state. |
|
|
|
<b>Results</b> |
|
|
|
Model | Test Perplexity | Number of Params [billions] |
|
------|-----------------|---------------------------- |
|
Sigmoid-RNN-2048 [Blackout] | 68.3 | 4.1 |
|
Interpolated KN 5-gram, 1.1B n-grams [chelba2013one] | 67.6 | 1.76 |
|
Sparse Non-Negative Matrix LM [shazeer2015sparse] | 52.9 | 33 |
|
RNN-1024 + MaxEnt 9-gram features [chelba2013one] | 51.3 | 20 |
|
LSTM-512-512 | 54.1 | 0.82 |
|
LSTM-1024-512 | 48.2 | 0.82 |
|
LSTM-2048-512 | 43.7 | 0.83 |
|
LSTM-8192-2048 (No Dropout) | 37.9 | 3.3 |
|
LSTM-8192-2048 (50\% Dropout) | 32.2 | 3.3 |
|
2-Layer LSTM-8192-1024 (BIG LSTM) | 30.6 | 1.8 |
|
(THIS RELEASE) BIG LSTM+CNN Inputs | <b>30.0</b> | <b>1.04</b> |
|
|
|
<b>How To Run</b> |
|
|
|
Prerequisites: |
|
|
|
* Install TensorFlow. |
|
* Install Bazel. |
|
* Download the data files: |
|
* Model GraphDef file: |
|
[link](http://download.tensorflow.org/models/LM_LSTM_CNN/graph-2016-09-10.pbtxt) |
|
* Model Checkpoint sharded file: |
|
[1](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-base) |
|
[2](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-char-embedding) |
|
[3](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-lstm) |
|
[4](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax0) |
|
[5](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax1) |
|
[6](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax2) |
|
[7](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax3) |
|
[8](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax4) |
|
[9](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax5) |
|
[10](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax6) |
|
[11](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax7) |
|
[12](http://download.tensorflow.org/models/LM_LSTM_CNN/all_shards-2016-09-10/ckpt-softmax8) |
|
* Vocabulary file: |
|
[link](http://download.tensorflow.org/models/LM_LSTM_CNN/vocab-2016-09-10.txt) |
|
* test dataset: link |
|
[link](http://download.tensorflow.org/models/LM_LSTM_CNN/test/news.en.heldout-00000-of-00050) |
|
* It is recommended to run on a modern desktop instead of a laptop. |
|
|
|
```shell |
|
# 1. Clone the code to your workspace. |
|
# 2. Download the data to your workspace. |
|
# 3. Create an empty WORKSPACE file in your workspace. |
|
# 4. Create an empty output directory in your workspace. |
|
# Example directory structure below: |
|
$ ls -R |
|
.: |
|
data lm_1b output WORKSPACE |
|
|
|
./data: |
|
ckpt-base ckpt-lstm ckpt-softmax1 ckpt-softmax3 ckpt-softmax5 |
|
ckpt-softmax7 graph-2016-09-10.pbtxt vocab-2016-09-10.txt |
|
ckpt-char-embedding ckpt-softmax0 ckpt-softmax2 ckpt-softmax4 ckpt-softmax6 |
|
ckpt-softmax8 news.en.heldout-00000-of-00050 |
|
|
|
./lm_1b: |
|
BUILD data_utils.py lm_1b_eval.py README.md |
|
|
|
./output: |
|
|
|
# Build the codes. |
|
$ bazel build -c opt lm_1b/... |
|
# Run sample mode: |
|
$ bazel-bin/lm_1b/lm_1b_eval --mode sample \ |
|
--prefix "I love that I" \ |
|
--pbtxt data/graph-2016-09-10.pbtxt \ |
|
--vocab_file data/vocab-2016-09-10.txt \ |
|
--ckpt 'data/ckpt-*' |
|
...(omitted some TensorFlow output) |
|
I love |
|
I love that |
|
I love that I |
|
I love that I find |
|
I love that I find that |
|
I love that I find that amazing |
|
...(omitted) |
|
|
|
# Run eval mode: |
|
$ bazel-bin/lm_1b/lm_1b_eval --mode eval \ |
|
--pbtxt data/graph-2016-09-10.pbtxt \ |
|
--vocab_file data/vocab-2016-09-10.txt \ |
|
--input_data data/news.en.heldout-00000-of-00050 \ |
|
--ckpt 'data/ckpt-*' |
|
...(omitted some TensorFlow output) |
|
Loaded step 14108582. |
|
# perplexity is high initially because words without context are harder to |
|
# predict. |
|
Eval Step: 0, Average Perplexity: 2045.512297. |
|
Eval Step: 1, Average Perplexity: 229.478699. |
|
Eval Step: 2, Average Perplexity: 208.116787. |
|
Eval Step: 3, Average Perplexity: 338.870601. |
|
Eval Step: 4, Average Perplexity: 228.950107. |
|
Eval Step: 5, Average Perplexity: 197.685857. |
|
Eval Step: 6, Average Perplexity: 156.287063. |
|
Eval Step: 7, Average Perplexity: 124.866189. |
|
Eval Step: 8, Average Perplexity: 147.204975. |
|
Eval Step: 9, Average Perplexity: 90.124864. |
|
Eval Step: 10, Average Perplexity: 59.897914. |
|
Eval Step: 11, Average Perplexity: 42.591137. |
|
...(omitted) |
|
Eval Step: 4529, Average Perplexity: 29.243668. |
|
Eval Step: 4530, Average Perplexity: 29.302362. |
|
Eval Step: 4531, Average Perplexity: 29.285674. |
|
...(omitted. At convergence, it should be around 30.) |
|
|
|
# Run dump_emb mode: |
|
$ bazel-bin/lm_1b/lm_1b_eval --mode dump_emb \ |
|
--pbtxt data/graph-2016-09-10.pbtxt \ |
|
--vocab_file data/vocab-2016-09-10.txt \ |
|
--ckpt 'data/ckpt-*' \ |
|
--save_dir output |
|
...(omitted some TensorFlow output) |
|
Finished softmax weights |
|
Finished word embedding 0/793471 |
|
Finished word embedding 1/793471 |
|
Finished word embedding 2/793471 |
|
...(omitted) |
|
$ ls output/ |
|
embeddings_softmax.npy ... |
|
|
|
# Run dump_lstm_emb mode: |
|
$ bazel-bin/lm_1b/lm_1b_eval --mode dump_lstm_emb \ |
|
--pbtxt data/graph-2016-09-10.pbtxt \ |
|
--vocab_file data/vocab-2016-09-10.txt \ |
|
--ckpt 'data/ckpt-*' \ |
|
--sentence "I love who I am ." \ |
|
--save_dir output |
|
$ ls output/ |
|
lstm_emb_step_0.npy lstm_emb_step_2.npy lstm_emb_step_4.npy |
|
lstm_emb_step_6.npy lstm_emb_step_1.npy lstm_emb_step_3.npy |
|
lstm_emb_step_5.npy |
|
``` |
|
|