|
# Cross-View Training |
|
|
|
This repository contains code for *Semi-Supervised Sequence Modeling with Cross-View Training*. Currently sequence tagging and dependency parsing tasks are supported. |
|
|
|
## Requirements |
|
* [Tensorflow](https://www.tensorflow.org/) |
|
* [Numpy](http://www.numpy.org/) |
|
|
|
This code has been run with TensorFlow 1.10.1 and Numpy 1.14.5; other versions may work, but have not been tested. |
|
|
|
## Fetching and Preprocessing Data |
|
Run `fetch_data.sh` to download and extract pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) vectors, the [1 Billion Word Language Model Benchmark](http://www.statmt.org/lm-benchmark/) corpus of unlabeled data, and the CoNLL-2000 [text chunking](https://www.clips.uantwerpen.be/conll2000/chunking/) dataset. Unfortunately the other datasets from our paper are not freely available and so can't be included in this repository. |
|
|
|
To apply CVT to other datasets, the data should be placed in `data/raw_data/<task_name>/(train|dev|test).txt`. For sequence tagging data, each line should contain a word followed by a space followed by that word's tag. Sentences should be separated by empty lines. For dependency parsing, each tag should be of the form ``<index_of_head>-<relation>`` (e.g., `0-root`). |
|
|
|
After all of the data has been downloaded, run `preprocessing.py`. |
|
|
|
## Training a Model |
|
Run `python cvt.py --mode=train --model_name=chunking_model`. By default this trains a model on the chunking data downloaded with `fetch_data.sh`. To change which task(s) are trained on or model hyperparameters, modify [base/configure.py](base/configure.py). Models are automatically checkpointed every 1000 steps; training will continue from the latest checkpoint if training is interrupted and restarted. Model checkpoints and other data such as dev set accuracy over time are stored in `data/models/<model_name>`. |
|
|
|
## Evaluating a Model |
|
Run `python cvt.py --mode=eval --model_name=chunking_model`. A CVT model trained on the chunking data for 200k steps should get at least 97.1 F1 on the dev set and 96.6 F1 on the test set. |
|
|
|
## Citation |
|
If you use this code for your publication, please cite the original paper: |
|
``` |
|
@inproceedings{clark2018semi, |
|
title = {Semi-Supervised Sequence Modeling with Cross-View Training}, |
|
author = {Kevin Clark and Minh-Thang Luong and Christopher D. Manning and Quoc V. Le}, |
|
booktitle = {EMNLP}, |
|
year = {2018} |
|
} |
|
``` |
|
|
|
## Contact |
|
* [Kevin Clark](https://cs.stanford.edu/~kevclark/) (@clarkkev). |
|
* [Thang Luong](https://nlp.stanford.edu/~lmthang/) (@lmthang). |
|
|
|
|