Finetune roberta

by humza-sami - opened Feb 20

Discussion

humza-sami

Feb 20

Hi, Is there any codebase or guidance which I follow to finetune roberta on my own dataset?

SamLowe

Owner Feb 20

•

edited Feb 20

Hi @humza-sami - I haven't published one but there should be some good ones online - try googling for AutoModelForSequenceClassification.from_pretrained problem_type=multi_label_classification or similar (as that's the method I used to generate this one) and I think you'll find guides in Medium articles and notebooks in Github that should get you on the right track.

For example https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb from @nielsr (which you could fork and modify to use Roberta and your own data)

Regards,
Sam.

humza-sami

Feb 20

Thanks @SamLowe , I am basically trying to finetuned it on three class problem. I will give it a shot as well

tamara08

Jul 7

hello, can you please tell me what are the hyperparameters used?
Did you use the AdamW optimizer? and can you please tell me what is the max sequence length?

SamLowe

Owner Aug 14

Hi @tamara08 apologies for the slow reply.

I don't have a record of the exact hyperparams used for the run that generated this model, but all runs I did were with the AdamW optimizer yes. And then I did a search using different learning rates and LR schedules.

Some base models I found really needed variable learning rates with restarts to get them out of local minima (particularly the newer variants of the BERT family, E.g. Deberta), but Roberta is of course almost the same as the original BERT and I found it pretty forgiving on that - it would train a classifier well with a wide variety of schedules and initial learning rates. Still cosine_with_restarts is the one I tended to most often use IIRC.

For max sequence length I believe I stuck to the default for the Roberta tokenizer, but shorter would do for this task given the shape of the go_emotions dataset and the intent of the classification needed - that of classifying single, or very short sequences of related, sentences. For a private model it would be logical to remove any longer sentences from the dataset for this task before training and eval, as they would be outliers and would not help subsequent usage, so a smaller max length such as 128 should be plenty. I did not however remove any for this model as it is on purpose coupled to the dataset, faults and all (of which there are quite a few in go_emotions!)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment