|
# Multi-doc News Headline Generation Model: NHNet |
|
|
|
This repository contains TensorFlow 2.x implementation for NHNet [[1]](#1) as |
|
well as instructions for producing the data we described in the paper. |
|
|
|
## Introduction |
|
|
|
NHNet is a multi-doc news headline generation model. It extends a standard |
|
Transformer-based encoder-decoder model to multi-doc setting and relies on an |
|
article-level attention layer to capture information common to most (if not all) |
|
input news articles in a news cluster or story, and provide robustness against |
|
potential outliers in the input due to clustering quality. |
|
|
|
Our academic paper [[1]](#1) which describes NHNet in detail can be found here: |
|
https://arxiv.org/abs/2001.09386. |
|
|
|
## Dataset |
|
|
|
**Raw Data:** One can [download](https://github.com/google-research-datasets/NewSHead) |
|
our multi-doc headline dataset which |
|
contains 369,940 news stories and 932,571 unique URLs. We split these stories |
|
into train (359,940 stories), validation (5,000 stories) and test set (5,000 |
|
stories) by timestamp. |
|
|
|
More information, please checkout: |
|
https://github.com/google-research-datasets/NewSHead |
|
|
|
### Crawling |
|
|
|
Unfortunately, we will not be able to release the pre-processed dataset that is |
|
exactly used in the paper. Users need to crawl the URLs and the recommended |
|
pre-processing is using an open-sourced library to download and parse the news |
|
content including title and leading paragraphs. For ease of this process, we |
|
provide a config of [news-please](https://github.com/fhamborg/news-please) that |
|
will crawl and extract news articles on a local machine. |
|
|
|
First, install the `news-please` CLI (requires python 3.x) |
|
```shell |
|
$ pip3 install news-please |
|
``` |
|
|
|
Next, run the crawler with our provided [config and URL list](https://github.com/google-research-datasets/NewSHead/releases) |
|
|
|
```shell |
|
# Sets to path of the downloaded data folder. |
|
$ DATA_FOLDER=/path/to/downloaded_dataset |
|
|
|
# Uses CLI interface to crawl. We assume news_please subfolder contains the |
|
# decompressed config.cfg and sitelist.hjson. |
|
$ news-please -c $DATA_FOLDER/news_please |
|
``` |
|
By default, it will store crawled |
|
articles under `/tmp/nhnet/`. To terminate the process press `CTRL+C`. |
|
|
|
The crawling may take some days (48 hours in our test) and it depends on the |
|
network environment and #threads set in the config. As the crawling tool won't |
|
stop automatically, it is not straightforward to check the progress. We suggest |
|
to terminate the job if there are no new articles crawled in a short time period |
|
(e.g., 10 minutes) by running |
|
```shell |
|
$ find /tmp/nhnet -type f | wc -l |
|
``` |
|
Please note that it is expected that some URLs are no longer available on the |
|
web as time goes by. |
|
|
|
### Data Processing |
|
|
|
Given the crawled articles under `/tmp/nhnet/`, we would like to transform these |
|
textual articles into a set of `TFRecord` files containing serialized |
|
tensorflow.Example protocol buffers, with feature keys following the BERT |
|
[[2]](#2) tradition but is extended for multiple text segments. We will later |
|
use these processed TFRecords for training and evaluation. |
|
|
|
To do this, please first download a [BERT pretrained checkpoint](https://github.com/tensorflow/models/tree/master/official/nlp/bert#access-to-pretrained-checkpoints) |
|
(`BERT-Base,Uncased` preferred for efficiency) and decompress the `tar.gz` file. |
|
We need the vocabulary file and later use the checkpoint for NHNet |
|
initialization. |
|
|
|
Next, we can run the following data preprocess script which may take a few hours |
|
to read files and tokenize article content. |
|
|
|
|
|
```shell |
|
# Recall that we use DATA_FOLDER=/path/to/downloaded_dataset. |
|
$ python3 raw_data_preprocess.py \ |
|
-crawled_articles=/tmp/nhnet \ |
|
-vocab=/path/to/bert_checkpoint/vocab.txt \ |
|
-do_lower_case=True \ |
|
-len_title=15 \ |
|
-len_passage=200 \ |
|
-max_num_articles=5 \ |
|
-data_folder=$DATA_FOLDER |
|
``` |
|
|
|
This python script will export processed train/valid/eval files under |
|
`$DATA_FOLDER/processed/`. |
|
|
|
## Training |
|
|
|
Please first install TensorFlow 2 and Tensorflow Model Garden following the |
|
[requirments section](https://github.com/tensorflow/models/tree/master/official#requirements). |
|
|
|
### CPU/GPU |
|
```shell |
|
$ python3 trainer.py \ |
|
--mode=train_and_eval \ |
|
--vocab=/path/to/bert_checkpoint/vocab.txt \ |
|
--init_checkpoint=/path/to/bert_checkpoint/bert_model.ckpt \ |
|
--params_override='init_from_bert2bert=false' \ |
|
--train_file_pattern=$DATA_FOLDER/processed/train.tfrecord* \ |
|
--model_dir=/path/to/output/model \ |
|
--len_title=15 \ |
|
--len_passage=200 \ |
|
--max_num_articles=5 \ |
|
--model_type=nhnet \ |
|
--train_batch_size=16 \ |
|
--train_steps=10000 \ |
|
--steps_per_loop=1 \ |
|
--checkpoint_interval=100 |
|
``` |
|
|
|
### TPU |
|
```shell |
|
$ python3 trainer.py \ |
|
--mode=train_and_eval \ |
|
--vocab=/path/to/bert_checkpoint/vocab.txt \ |
|
--init_checkpoint=/path/to/bert_checkpoint/bert_model.ckpt \ |
|
--params_override='init_from_bert2bert=false' \ |
|
--train_file_pattern=$DATA_FOLDER/processed/train.tfrecord* \ |
|
--model_dir=/path/to/output/model \ |
|
--len_title=15 \ |
|
--len_passage=200 \ |
|
--max_num_articles=5 \ |
|
--model_type=nhnet \ |
|
--train_batch_size=1024 \ |
|
--train_steps=10000 \ |
|
--steps_per_loop=1000 \ |
|
--checkpoint_interval=1000 \ |
|
--distribution_strategy=tpu \ |
|
--tpu=grpc://${TPU_IP_ADDRESS}:8470 |
|
``` |
|
In the paper, we train more than 10k steps with batch size set as 1024 with |
|
TPU-v3-64. |
|
|
|
Note that, `trainer.py` also supports `train` mode and continuous `eval` mode. |
|
For large scale TPU training, we recommend the have a process running the |
|
`train` mode and another process running the continuous `eval` mode which can |
|
runs on GPUs. |
|
This is the setting we commonly used for large-scale experiments, because `eval` |
|
will be non-blocking to the expensive training load. |
|
|
|
### Metrics |
|
**Note: the metrics reported by `evaluation.py` are approximated on |
|
word-piece level rather than the real string tokens. Some metrics like BLEU |
|
scores can be off.** |
|
|
|
We will release a colab to evaluate results on string-level soon. |
|
|
|
## References |
|
|
|
<a id="1">[1]</a> Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You Wu, Cong |
|
Yu, Daniel Finnie, Hongkun Yu, Jiaqi Zhai and Nicholas Zukoski "Generating |
|
Representative Headlines for News Stories": https://arxiv.org/abs/2001.09386. |
|
World Wide Web Conf. (WWW’2020). |
|
|
|
<a id="2">[2]</a> Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina |
|
Toutanova "BERT: Pre-training of Deep Bidirectional Transformers for Language |
|
Understanding": https://arxiv.org/abs/1810.04805. |
|
|