|
# Finetuning RoBERTa on a custom classification task |
|
|
|
This example shows how to finetune RoBERTa on the IMDB dataset, but should illustrate the process for most classification tasks. |
|
|
|
### 1) Get the data |
|
|
|
```bash |
|
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz |
|
tar zxvf aclImdb_v1.tar.gz |
|
``` |
|
|
|
|
|
### 2) Format data |
|
|
|
`IMDB` data has one data-sample in each file, below python code-snippet converts it one file for train and valid each for ease of processing. |
|
```python |
|
import argparse |
|
import os |
|
import random |
|
from glob import glob |
|
|
|
random.seed(0) |
|
|
|
def main(args): |
|
for split in ['train', 'test']: |
|
samples = [] |
|
for class_label in ['pos', 'neg']: |
|
fnames = glob(os.path.join(args.datadir, split, class_label) + '/*.txt') |
|
for fname in fnames: |
|
with open(fname) as fin: |
|
line = fin.readline() |
|
samples.append((line, 1 if class_label == 'pos' else 0)) |
|
random.shuffle(samples) |
|
out_fname = 'train' if split == 'train' else 'dev' |
|
f1 = open(os.path.join(args.datadir, out_fname + '.input0'), 'w') |
|
f2 = open(os.path.join(args.datadir, out_fname + '.label'), 'w') |
|
for sample in samples: |
|
f1.write(sample[0] + '\n') |
|
f2.write(str(sample[1]) + '\n') |
|
f1.close() |
|
f2.close() |
|
|
|
if __name__ == '__main__': |
|
parser = argparse.ArgumentParser() |
|
parser.add_argument('--datadir', default='aclImdb') |
|
args = parser.parse_args() |
|
main(args) |
|
``` |
|
|
|
|
|
### 3) BPE encode |
|
|
|
Run `multiprocessing_bpe_encoder`, you can also do this in previous step for each sample but that might be slower. |
|
```bash |
|
# Download encoder.json and vocab.bpe |
|
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json' |
|
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe' |
|
|
|
for SPLIT in train dev; do |
|
python -m examples.roberta.multiprocessing_bpe_encoder \ |
|
--encoder-json encoder.json \ |
|
--vocab-bpe vocab.bpe \ |
|
--inputs "aclImdb/$SPLIT.input0" \ |
|
--outputs "aclImdb/$SPLIT.input0.bpe" \ |
|
--workers 60 \ |
|
--keep-empty |
|
done |
|
``` |
|
|
|
|
|
### 4) Preprocess data |
|
|
|
```bash |
|
# Download fairseq dictionary. |
|
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt' |
|
|
|
fairseq-preprocess \ |
|
--only-source \ |
|
--trainpref "aclImdb/train.input0.bpe" \ |
|
--validpref "aclImdb/dev.input0.bpe" \ |
|
--destdir "IMDB-bin/input0" \ |
|
--workers 60 \ |
|
--srcdict dict.txt |
|
|
|
fairseq-preprocess \ |
|
--only-source \ |
|
--trainpref "aclImdb/train.label" \ |
|
--validpref "aclImdb/dev.label" \ |
|
--destdir "IMDB-bin/label" \ |
|
--workers 60 |
|
|
|
``` |
|
|
|
|
|
### 5) Run training |
|
|
|
```bash |
|
TOTAL_NUM_UPDATES=7812 # 10 epochs through IMDB for bsz 32 |
|
WARMUP_UPDATES=469 # 6 percent of the number of updates |
|
LR=1e-05 # Peak LR for polynomial LR scheduler. |
|
HEAD_NAME=imdb_head # Custom name for the classification head. |
|
NUM_CLASSES=2 # Number of classes for the classification task. |
|
MAX_SENTENCES=8 # Batch size. |
|
ROBERTA_PATH=/path/to/roberta.large/model.pt |
|
|
|
CUDA_VISIBLE_DEVICES=0 fairseq-train IMDB-bin/ \ |
|
--restore-file $ROBERTA_PATH \ |
|
--max-positions 512 \ |
|
--batch-size $MAX_SENTENCES \ |
|
--max-tokens 4400 \ |
|
--task sentence_prediction \ |
|
--reset-optimizer --reset-dataloader --reset-meters \ |
|
--required-batch-size-multiple 1 \ |
|
--init-token 0 --separator-token 2 \ |
|
--arch roberta_large \ |
|
--criterion sentence_prediction \ |
|
--classification-head-name $HEAD_NAME \ |
|
--num-classes $NUM_CLASSES \ |
|
--dropout 0.1 --attention-dropout 0.1 \ |
|
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \ |
|
--clip-norm 0.0 \ |
|
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ |
|
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \ |
|
--max-epoch 10 \ |
|
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ |
|
--shorten-method "truncate" \ |
|
--find-unused-parameters \ |
|
--update-freq 4 |
|
``` |
|
|
|
The above command will finetune RoBERTa-large with an effective batch-size of 32 |
|
sentences (`--batch-size=8 --update-freq=4`). The expected |
|
`best-validation-accuracy` after 10 epochs is ~96.5%. |
|
|
|
If you run out of GPU memory, try decreasing `--batch-size` and increase |
|
`--update-freq` to compensate. |
|
|
|
|
|
### 6) Load model using hub interface |
|
|
|
Now we can load the trained model checkpoint using the RoBERTa hub interface. |
|
|
|
Assuming your checkpoints are stored in `checkpoints/`: |
|
```python |
|
from fairseq.models.roberta import RobertaModel |
|
roberta = RobertaModel.from_pretrained( |
|
'checkpoints', |
|
checkpoint_file='checkpoint_best.pt', |
|
data_name_or_path='IMDB-bin' |
|
) |
|
roberta.eval() # disable dropout |
|
``` |
|
|
|
Finally you can make predictions using the `imdb_head` (or whatever you set |
|
`--classification-head-name` to during training): |
|
```python |
|
label_fn = lambda label: roberta.task.label_dictionary.string( |
|
[label + roberta.task.label_dictionary.nspecial] |
|
) |
|
|
|
tokens = roberta.encode('Best movie this year') |
|
pred = label_fn(roberta.predict('imdb_head', tokens).argmax().item()) |
|
assert pred == '1' # positive |
|
|
|
tokens = roberta.encode('Worst movie ever') |
|
pred = label_fn(roberta.predict('imdb_head', tokens).argmax().item()) |
|
assert pred == '0' # negative |
|
``` |
|
|