|
# Finetuning RoBERTa on RACE tasks |
|
|
|
### 1) Download the data from RACE website (http://www.cs.cmu.edu/~glai1/data/race/) |
|
|
|
### 2) Preprocess RACE data: |
|
```bash |
|
python ./examples/roberta/preprocess_RACE.py --input-dir <input-dir> --output-dir <extracted-data-dir> |
|
./examples/roberta/preprocess_RACE.sh <extracted-data-dir> <output-dir> |
|
``` |
|
|
|
### 3) Fine-tuning on RACE: |
|
|
|
```bash |
|
MAX_EPOCH=5 # Number of training epochs. |
|
LR=1e-05 # Peak LR for fixed LR scheduler. |
|
NUM_CLASSES=4 |
|
MAX_SENTENCES=1 # Batch size per GPU. |
|
UPDATE_FREQ=8 # Accumulate gradients to simulate training on 8 GPUs. |
|
DATA_DIR=/path/to/race-output-dir |
|
ROBERTA_PATH=/path/to/roberta/model.pt |
|
|
|
CUDA_VISIBLE_DEVICES=0,1 fairseq-train $DATA_DIR --ddp-backend=legacy_ddp \ |
|
--restore-file $ROBERTA_PATH \ |
|
--reset-optimizer --reset-dataloader --reset-meters \ |
|
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \ |
|
--task sentence_ranking \ |
|
--num-classes $NUM_CLASSES \ |
|
--init-token 0 --separator-token 2 \ |
|
--max-option-length 128 \ |
|
--max-positions 512 \ |
|
--shorten-method "truncate" \ |
|
--arch roberta_large \ |
|
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \ |
|
--criterion sentence_ranking \ |
|
--optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \ |
|
--clip-norm 0.0 \ |
|
--lr-scheduler fixed --lr $LR \ |
|
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \ |
|
--batch-size $MAX_SENTENCES \ |
|
--required-batch-size-multiple 1 \ |
|
--update-freq $UPDATE_FREQ \ |
|
--max-epoch $MAX_EPOCH |
|
``` |
|
|
|
**Note:** |
|
|
|
a) As contexts in RACE are relatively long, we are using smaller batch size per GPU while increasing update-freq to achieve larger effective batch size. |
|
|
|
b) Above cmd-args and hyperparams are tested on one Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--batch-size`. |
|
|
|
c) The setting in above command is based on our hyperparam search within a fixed search space (for careful comparison across models). You might be able to find better metrics with wider hyperparam search. |
|
|
|
### 4) Evaluation: |
|
|
|
``` |
|
DATA_DIR=/path/to/race-output-dir # data directory used during training |
|
MODEL_PATH=/path/to/checkpoint_best.pt # path to the finetuned model checkpoint |
|
PREDS_OUT=preds.tsv # output file path to save prediction |
|
TEST_SPLIT=test # can be test (Middle) or test1 (High) |
|
fairseq-validate \ |
|
$DATA_DIR \ |
|
--valid-subset $TEST_SPLIT \ |
|
--path $MODEL_PATH \ |
|
--batch-size 1 \ |
|
--task sentence_ranking \ |
|
--criterion sentence_ranking \ |
|
--save-predictions $PREDS_OUT |
|
``` |
|
|