|
 |
|
 |
|
 |
|
|
|
# Skip-Thought Vectors |
|
|
|
This is a TensorFlow implementation of the model described in: |
|
|
|
Jamie Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, |
|
Antonio Torralba, Raquel Urtasun, Sanja Fidler. |
|
[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf). |
|
*In NIPS, 2015.* |
|
|
|
|
|
## Contact |
|
***Code author:*** Chris Shallue |
|
|
|
***Pull requests and issues:*** @cshallue |
|
|
|
## Contents |
|
* [Model Overview](#model-overview) |
|
* [Getting Started](#getting-started) |
|
* [Install Required Packages](#install-required-packages) |
|
* [Download Pretrained Models (Optional)](#download-pretrained-models-optional) |
|
* [Training a Model](#training-a-model) |
|
* [Prepare the Training Data](#prepare-the-training-data) |
|
* [Run the Training Script](#run-the-training-script) |
|
* [Track Training Progress](#track-training-progress) |
|
* [Expanding the Vocabulary](#expanding-the-vocabulary) |
|
* [Overview](#overview) |
|
* [Preparation](#preparation) |
|
* [Run the Vocabulary Expansion Script](#run-the-vocabulary-expansion-script) |
|
* [Evaluating a Model](#evaluating-a-model) |
|
* [Overview](#overview-1) |
|
* [Preparation](#preparation-1) |
|
* [Run the Evaluation Tasks](#run-the-evaluation-tasks) |
|
* [Encoding Sentences](#encoding-sentences) |
|
|
|
## Model overview |
|
|
|
The *Skip-Thoughts* model is a sentence encoder. It learns to encode input |
|
sentences into a fixed-dimensional vector representation that is useful for many |
|
tasks, for example to detect paraphrases or to classify whether a product review |
|
is positive or negative. See the |
|
[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf) |
|
paper for details of the model architecture and more example applications. |
|
|
|
A trained *Skip-Thoughts* model will encode similar sentences nearby each other |
|
in the embedding vector space. The following examples show the nearest neighbor by |
|
cosine similarity of some sentences from the |
|
[movie review dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/). |
|
|
|
|
|
| Input sentence | Nearest Neighbor | |
|
|----------------|------------------| |
|
| Simplistic, silly and tedious. | Trite, banal, cliched, mostly inoffensive. | |
|
| Not so much farcical as sour. | Not only unfunny, but downright repellent. | |
|
| A sensitive and astute first feature by Anne-Sophie Birot. | Absorbing character study by André Turpin . | |
|
| An enthralling, entertaining feature. | A slick, engrossing melodrama. | |
|
|
|
## Getting Started |
|
|
|
### Install Required Packages |
|
First ensure that you have installed the following required packages: |
|
|
|
* **Bazel** ([instructions](http://bazel.build/docs/install.html)) |
|
* **TensorFlow** ([instructions](https://www.tensorflow.org/install/)) |
|
* **NumPy** ([instructions](http://www.scipy.org/install.html)) |
|
* **scikit-learn** ([instructions](http://scikit-learn.org/stable/install.html)) |
|
* **Natural Language Toolkit (NLTK)** |
|
* First install NLTK ([instructions](http://www.nltk.org/install.html)) |
|
* Then install the NLTK data ([instructions](http://www.nltk.org/data.html)) |
|
* **gensim** ([instructions](https://radimrehurek.com/gensim/install.html)) |
|
* Only required if you will be expanding your vocabulary with the [word2vec](https://code.google.com/archive/p/word2vec/) model. |
|
|
|
|
|
### Download Pretrained Models (Optional) |
|
|
|
You can download model checkpoints pretrained on the |
|
[BookCorpus](http://yknzhu.wixsite.com/mbweb) dataset in the following |
|
configurations: |
|
|
|
* Unidirectional RNN encoder ("uni-skip" in the paper) |
|
* Bidirectional RNN encoder ("bi-skip" in the paper) |
|
|
|
```shell |
|
# Directory to download the pretrained models to. |
|
PRETRAINED_MODELS_DIR="${HOME}/skip_thoughts/pretrained/" |
|
|
|
mkdir -p ${PRETRAINED_MODELS_DIR} |
|
cd ${PRETRAINED_MODELS_DIR} |
|
|
|
# Download and extract the unidirectional model. |
|
wget "http://download.tensorflow.org/models/skip_thoughts_uni_2017_02_02.tar.gz" |
|
tar -xvf skip_thoughts_uni_2017_02_02.tar.gz |
|
rm skip_thoughts_uni_2017_02_02.tar.gz |
|
|
|
# Download and extract the bidirectional model. |
|
wget "http://download.tensorflow.org/models/skip_thoughts_bi_2017_02_16.tar.gz" |
|
tar -xvf skip_thoughts_bi_2017_02_16.tar.gz |
|
rm skip_thoughts_bi_2017_02_16.tar.gz |
|
``` |
|
|
|
You can now skip to the sections [Evaluating a Model](#evaluating-a-model) and |
|
[Encoding Sentences](#encoding-sentences). |
|
|
|
|
|
## Training a Model |
|
|
|
### Prepare the Training Data |
|
|
|
To train a model you will need to provide training data in TFRecord format. The |
|
TFRecord format consists of a set of sharded files containing serialized |
|
`tf.Example` protocol buffers. Each `tf.Example` proto contains three |
|
sentences: |
|
|
|
* `encode`: The sentence to encode. |
|
* `decode_pre`: The sentence preceding `encode` in the original text. |
|
* `decode_post`: The sentence following `encode` in the original text. |
|
|
|
Each sentence is a list of words. During preprocessing, a dictionary is created |
|
that assigns each word in the vocabulary to an integer-valued id. Each sentence |
|
is encoded as a list of integer word ids in the `tf.Example` protos. |
|
|
|
We have provided a script to preprocess any set of text-files into this format. |
|
You may wish to use the [BookCorpus](http://yknzhu.wixsite.com/mbweb) dataset. |
|
Note that the preprocessing script may take **12 hours** or more to complete |
|
on this large dataset. |
|
|
|
```shell |
|
# Comma-separated list of globs matching the input input files. The format of |
|
# the input files is assumed to be a list of newline-separated sentences, where |
|
# each sentence is already tokenized. |
|
INPUT_FILES="${HOME}/skip_thoughts/bookcorpus/*.txt" |
|
|
|
# Location to save the preprocessed training and validation data. |
|
DATA_DIR="${HOME}/skip_thoughts/data" |
|
|
|
# Build the preprocessing script. |
|
cd tensorflow-models/skip_thoughts |
|
bazel build -c opt //skip_thoughts/data:preprocess_dataset |
|
|
|
# Run the preprocessing script. |
|
bazel-bin/skip_thoughts/data/preprocess_dataset \ |
|
--input_files=${INPUT_FILES} \ |
|
--output_dir=${DATA_DIR} |
|
``` |
|
|
|
When the script finishes you will find 100 training files and 1 validation file |
|
in `DATA_DIR`. The files will match the patterns `train-?????-of-00100` and |
|
`validation-00000-of-00001` respectively. |
|
|
|
The script will also produce a file named `vocab.txt`. The format of this file |
|
is a list of newline-separated words where the word id is the corresponding 0- |
|
based line index. Words are sorted by descending order of frequency in the input |
|
data. Only the top 20,000 words are assigned unique ids; all other words are |
|
assigned the "unknown id" of 1 in the processed data. |
|
|
|
### Run the Training Script |
|
|
|
Execute the following commands to start the training script. By default it will |
|
run for 500k steps (around 9 days on a GeForce GTX 1080 GPU). |
|
|
|
```shell |
|
# Directory containing the preprocessed data. |
|
DATA_DIR="${HOME}/skip_thoughts/data" |
|
|
|
# Directory to save the model. |
|
MODEL_DIR="${HOME}/skip_thoughts/model" |
|
|
|
# Build the model. |
|
cd tensorflow-models/skip_thoughts |
|
bazel build -c opt //skip_thoughts/... |
|
|
|
# Run the training script. |
|
bazel-bin/skip_thoughts/train \ |
|
--input_file_pattern="${DATA_DIR}/train-?????-of-00100" \ |
|
--train_dir="${MODEL_DIR}/train" |
|
``` |
|
|
|
### Track Training Progress |
|
|
|
Optionally, you can run the `track_perplexity` script in a separate process. |
|
This will log per-word perplexity on the validation set which allows training |
|
progress to be monitored on |
|
[TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard). |
|
|
|
Note that you may run out of memory if you run the this script on the same GPU |
|
as the training script. You can set the environment variable |
|
`CUDA_VISIBLE_DEVICES=""` to force the script to run on CPU. If it runs too |
|
slowly on CPU, you can decrease the value of `--num_eval_examples`. |
|
|
|
```shell |
|
DATA_DIR="${HOME}/skip_thoughts/data" |
|
MODEL_DIR="${HOME}/skip_thoughts/model" |
|
|
|
# Ignore GPU devices (only necessary if your GPU is currently memory |
|
# constrained, for example, by running the training script). |
|
export CUDA_VISIBLE_DEVICES="" |
|
|
|
# Run the evaluation script. This will run in a loop, periodically loading the |
|
# latest model checkpoint file and computing evaluation metrics. |
|
bazel-bin/skip_thoughts/track_perplexity \ |
|
--input_file_pattern="${DATA_DIR}/validation-?????-of-00001" \ |
|
--checkpoint_dir="${MODEL_DIR}/train" \ |
|
--eval_dir="${MODEL_DIR}/val" \ |
|
--num_eval_examples=50000 |
|
``` |
|
|
|
If you started the `track_perplexity` script, run a |
|
[TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) |
|
server in a separate process for real-time monitoring of training summaries and |
|
validation perplexity. |
|
|
|
```shell |
|
MODEL_DIR="${HOME}/skip_thoughts/model" |
|
|
|
# Run a TensorBoard server. |
|
tensorboard --logdir="${MODEL_DIR}" |
|
``` |
|
|
|
## Expanding the Vocabulary |
|
|
|
### Overview |
|
|
|
The vocabulary generated by the preprocessing script contains only 20,000 words |
|
which is insufficient for many tasks. For example, a sentence from Wikipedia |
|
might contain nouns that do not appear in this vocabulary. |
|
|
|
A solution to this problem described in the |
|
[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf) |
|
paper is to learn a mapping that transfers word representations from one model to |
|
another. This idea is based on the "Translation Matrix" method from the paper |
|
[Exploiting Similarities Among Languages for Machine Translation](https://arxiv.org/abs/1309.4168). |
|
|
|
|
|
Specifically, we will load the word embeddings from a trained *Skip-Thoughts* |
|
model and from a trained [word2vec model](https://arxiv.org/pdf/1301.3781.pdf) |
|
(which has a much larger vocabulary). We will train a linear regression model |
|
without regularization to learn a linear mapping from the word2vec embedding |
|
space to the *Skip-Thoughts* embedding space. We will then apply the linear |
|
model to all words in the word2vec vocabulary, yielding vectors in the *Skip- |
|
Thoughts* word embedding space for the union of the two vocabularies. |
|
|
|
The linear regression task is to learn a parameter matrix *W* to minimize |
|
*|| X - Y \* W ||<sup>2</sup>*, where *X* is a matrix of *Skip-Thoughts* |
|
embeddings of shape `[num_words, dim1]`, *Y* is a matrix of word2vec embeddings |
|
of shape `[num_words, dim2]`, and *W* is a matrix of shape `[dim2, dim1]`. |
|
|
|
### Preparation |
|
|
|
First you will need to download and unpack a pretrained |
|
[word2vec model](https://arxiv.org/pdf/1301.3781.pdf) from |
|
[this website](https://code.google.com/archive/p/word2vec/) |
|
([direct download link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)). |
|
This model was trained on the Google News dataset (about 100 billion words). |
|
|
|
|
|
Also ensure that you have already [installed gensim](https://radimrehurek.com/gensim/install.html). |
|
|
|
### Run the Vocabulary Expansion Script |
|
|
|
```shell |
|
# Path to checkpoint file or a directory containing checkpoint files (the script |
|
# will select the most recent). |
|
CHECKPOINT_PATH="${HOME}/skip_thoughts/model/train" |
|
|
|
# Vocabulary file generated by the preprocessing script. |
|
SKIP_THOUGHTS_VOCAB="${HOME}/skip_thoughts/data/vocab.txt" |
|
|
|
# Path to downloaded word2vec model. |
|
WORD2VEC_MODEL="${HOME}/skip_thoughts/googlenews/GoogleNews-vectors-negative300.bin" |
|
|
|
# Output directory. |
|
EXP_VOCAB_DIR="${HOME}/skip_thoughts/exp_vocab" |
|
|
|
# Build the vocabulary expansion script. |
|
cd tensorflow-models/skip_thoughts |
|
bazel build -c opt //skip_thoughts:vocabulary_expansion |
|
|
|
# Run the vocabulary expansion script. |
|
bazel-bin/skip_thoughts/vocabulary_expansion \ |
|
--skip_thoughts_model=${CHECKPOINT_PATH} \ |
|
--skip_thoughts_vocab=${SKIP_THOUGHTS_VOCAB} \ |
|
--word2vec_model=${WORD2VEC_MODEL} \ |
|
--output_dir=${EXP_VOCAB_DIR} |
|
``` |
|
|
|
## Evaluating a Model |
|
|
|
### Overview |
|
|
|
The model can be evaluated using the benchmark tasks described in the |
|
[Skip-Thought Vectors](https://papers.nips.cc/paper/5950-skip-thought-vectors.pdf) |
|
paper. The following tasks are supported (refer to the paper for full details): |
|
|
|
* **SICK** semantic relatedness task. |
|
* **MSRP** (Microsoft Research Paraphrase Corpus) paraphrase detection task. |
|
* Binary classification tasks: |
|
* **MR** movie review sentiment task. |
|
* **CR** customer product review task. |
|
* **SUBJ** subjectivity/objectivity task. |
|
* **MPQA** opinion polarity task. |
|
* **TREC** question-type classification task. |
|
|
|
### Preparation |
|
|
|
You will need to clone or download the |
|
[skip-thoughts GitHub repository](https://github.com/ryankiros/skip-thoughts) by |
|
[ryankiros](https://github.com/ryankiros) (the first author of the Skip-Thoughts |
|
paper): |
|
|
|
```shell |
|
# Folder to clone the repository to. |
|
ST_KIROS_DIR="${HOME}/skip_thoughts/skipthoughts_kiros" |
|
|
|
# Clone the repository. |
|
git clone [email protected]:ryankiros/skip-thoughts.git "${ST_KIROS_DIR}/skipthoughts" |
|
|
|
# Make the package importable. |
|
export PYTHONPATH="${ST_KIROS_DIR}/:${PYTHONPATH}" |
|
``` |
|
|
|
You will also need to download the data needed for each evaluation task. See the |
|
instructions [here](https://github.com/ryankiros/skip-thoughts). |
|
|
|
For example, the CR (customer review) dataset is found [here](http://nlp.stanford.edu/~sidaw/home/projects:nbsvm). For this task we want the |
|
files `custrev.pos` and `custrev.neg`. |
|
|
|
### Run the Evaluation Tasks |
|
|
|
In the following example we will evaluate a unidirectional model ("uni-skip" in |
|
the paper) on the CR task. To use a bidirectional model ("bi-skip" in the |
|
paper), simply pass the flags `--bi_vocab_file`, `--bi_embeddings_file` and |
|
`--bi_checkpoint_path` instead. To use the "combine-skip" model described in the |
|
paper you will need to pass both the unidirectional and bidirectional flags. |
|
|
|
```shell |
|
# Path to checkpoint file or a directory containing checkpoint files (the script |
|
# will select the most recent). |
|
CHECKPOINT_PATH="${HOME}/skip_thoughts/model/train" |
|
|
|
# Vocabulary file generated by the vocabulary expansion script. |
|
VOCAB_FILE="${HOME}/skip_thoughts/exp_vocab/vocab.txt" |
|
|
|
# Embeddings file generated by the vocabulary expansion script. |
|
EMBEDDINGS_FILE="${HOME}/skip_thoughts/exp_vocab/embeddings.npy" |
|
|
|
# Directory containing files custrev.pos and custrev.neg. |
|
EVAL_DATA_DIR="${HOME}/skip_thoughts/eval_data" |
|
|
|
# Build the evaluation script. |
|
cd tensorflow-models/skip_thoughts |
|
bazel build -c opt //skip_thoughts:evaluate |
|
|
|
# Run the evaluation script. |
|
bazel-bin/skip_thoughts/evaluate \ |
|
--eval_task=CR \ |
|
--data_dir=${EVAL_DATA_DIR} \ |
|
--uni_vocab_file=${VOCAB_FILE} \ |
|
--uni_embeddings_file=${EMBEDDINGS_FILE} \ |
|
--uni_checkpoint_path=${CHECKPOINT_PATH} |
|
``` |
|
|
|
Output: |
|
|
|
```python |
|
[0.82539682539682535, 0.84084880636604775, 0.83023872679045096, |
|
0.86206896551724133, 0.83554376657824936, 0.85676392572944293, |
|
0.84084880636604775, 0.83023872679045096, 0.85145888594164454, |
|
0.82758620689655171] |
|
``` |
|
|
|
The output is a list of accuracies of 10 cross-validation classification models. |
|
To get a single number, simply take the average: |
|
|
|
```python |
|
ipython # Launch iPython. |
|
|
|
In [0]: |
|
import numpy as np |
|
np.mean([0.82539682539682535, 0.84084880636604775, 0.83023872679045096, |
|
0.86206896551724133, 0.83554376657824936, 0.85676392572944293, |
|
0.84084880636604775, 0.83023872679045096, 0.85145888594164454, |
|
0.82758620689655171]) |
|
|
|
Out [0]: 0.84009936423729525 |
|
``` |
|
|
|
## Encoding Sentences |
|
|
|
In this example we will encode data from the |
|
[movie review dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/) |
|
(specifically the [sentence polarity dataset v1.0](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz)). |
|
|
|
```python |
|
ipython # Launch iPython. |
|
|
|
In [0]: |
|
|
|
# Imports. |
|
from __future__ import absolute_import |
|
from __future__ import division |
|
from __future__ import print_function |
|
import numpy as np |
|
import os.path |
|
import scipy.spatial.distance as sd |
|
from skip_thoughts import configuration |
|
from skip_thoughts import encoder_manager |
|
|
|
In [1]: |
|
# Set paths to the model. |
|
VOCAB_FILE = "/path/to/vocab.txt" |
|
EMBEDDING_MATRIX_FILE = "/path/to/embeddings.npy" |
|
CHECKPOINT_PATH = "/path/to/model.ckpt-9999" |
|
# The following directory should contain files rt-polarity.neg and |
|
# rt-polarity.pos. |
|
MR_DATA_DIR = "/dir/containing/mr/data" |
|
|
|
In [2]: |
|
# Set up the encoder. Here we are using a single unidirectional model. |
|
# To use a bidirectional model as well, call load_model() again with |
|
# configuration.model_config(bidirectional_encoder=True) and paths to the |
|
# bidirectional model's files. The encoder will use the concatenation of |
|
# all loaded models. |
|
encoder = encoder_manager.EncoderManager() |
|
encoder.load_model(configuration.model_config(), |
|
vocabulary_file=VOCAB_FILE, |
|
embedding_matrix_file=EMBEDDING_MATRIX_FILE, |
|
checkpoint_path=CHECKPOINT_PATH) |
|
|
|
In [3]: |
|
# Load the movie review dataset. |
|
data = [] |
|
with open(os.path.join(MR_DATA_DIR, 'rt-polarity.neg'), 'rb') as f: |
|
data.extend([line.decode('latin-1').strip() for line in f]) |
|
with open(os.path.join(MR_DATA_DIR, 'rt-polarity.pos'), 'rb') as f: |
|
data.extend([line.decode('latin-1').strip() for line in f]) |
|
|
|
In [4]: |
|
# Generate Skip-Thought Vectors for each sentence in the dataset. |
|
encodings = encoder.encode(data) |
|
|
|
In [5]: |
|
# Define a helper function to generate nearest neighbors. |
|
def get_nn(ind, num=10): |
|
encoding = encodings[ind] |
|
scores = sd.cdist([encoding], encodings, "cosine")[0] |
|
sorted_ids = np.argsort(scores) |
|
print("Sentence:") |
|
print("", data[ind]) |
|
print("\nNearest neighbors:") |
|
for i in range(1, num + 1): |
|
print(" %d. %s (%.3f)" % |
|
(i, data[sorted_ids[i]], scores[sorted_ids[i]])) |
|
|
|
In [6]: |
|
# Compute nearest neighbors of the first sentence in the dataset. |
|
get_nn(0) |
|
``` |
|
|
|
Output: |
|
|
|
``` |
|
Sentence: |
|
simplistic , silly and tedious . |
|
|
|
Nearest neighbors: |
|
1. trite , banal , cliched , mostly inoffensive . (0.247) |
|
2. banal and predictable . (0.253) |
|
3. witless , pointless , tasteless and idiotic . (0.272) |
|
4. loud , silly , stupid and pointless . (0.295) |
|
5. grating and tedious . (0.299) |
|
6. idiotic and ugly . (0.330) |
|
7. black-and-white and unrealistic . (0.335) |
|
8. hopelessly inane , humorless and under-inspired . (0.335) |
|
9. shallow , noisy and pretentious . (0.340) |
|
10. . . . unlikable , uninteresting , unfunny , and completely , utterly inept . (0.346) |
|
``` |
|
|