FUXI
/

yuyan-10b

Model card Files Files and versions Community

yuyan-10b / tasks /orqa /README.md

Shawn001's picture

Upload 53 files

c2c125c over 1 year ago

|

history blame contribute delete

2.58 kB

	## End-to-End Training of Neural Retrievers for Open-Domain Question Answering

	Below we present the steps to run unsupervised and supervised trainining and evaluation of the retriever for [open domain question answering](https://arxiv.org/abs/2101.00408).

	## Retriever Training

	#### Unsupervised pretraining
	1. Use `tools/preprocess_data.py` to preprocess the dataset for Inverse Cloze Task (ICT), which we call unsupervised pretraining. This script takes as input a corpus in loose JSON format and creates fixed-size blocks of text as the fundamental units of data. For a corpus like Wikipedia, this will mean multiple sentences per block and multiple blocks per document. Run [`tools/preprocess_data.py`](../../tools/preprocess_data.py) to construct one or more indexed datasets with the `--split-sentences` argument to make sentences the basic unit. We construct two datasets, one with the title of every document and another with the body.

	<pre>
	python tools/preprocess_data.py \
	--input /path/to/corpus.json \
	--json-keys text title \
	--split-sentences \
	--tokenizer-type BertWordPieceLowerCase \
	--vocab-file /path/to/vocab.txt \
	--output-prefix corpus_indexed \
	--workers 10
	</pre>

	2. The [`examples/pretrain_ict.sh`](../../examples/pretrain_ict.sh) script runs a single GPU 217M parameter biencoder model for ICT retriever training. Single GPU training is primarily intended for debugging purposes, as the code is developed for distributed training. The script uses a pretrained BERT model and we use a total of batch size of 4096 for the ICT training.

	3. Evaluate the pretrained ICT model using [`examples/evaluate_retriever_nq.sh`](../../examples/evaluate_retriever_nq.sh) for [Google's Natural Questions Open dataset](https://arxiv.org/pdf/1906.00300.pdf).

	#### Supervised finetuning

	1. Use the above pretrained ICT model to finetune using [Google's Natural Questions Open dataset](https://github.com/google-research/language/tree/master/language/orqa). The script [`examples/finetune_retriever_distributed.sh`](../../examples/finetune_retriever_distributed.sh) provides an example for how to perform the training. Our finetuning process includes retriever score scaling and longer training (80 epochs) on top [DPR training](https://arxiv.org/abs/2004.04906).

	2. Evaluate the finetuned model using the same evaluation script as mentioned above for the unsupervised model.

	More details on the retriever are available in [our paper](https://arxiv.org/abs/2101.00408).

	## Reader Training

	The reader component will be available soon.