Fine tuning on SNLI-VE (visual entailment) with transformers models & Trainer
Hi,
Thank you for publishing your outperforming models to 'transformers' library format, a good step to make them become foundation models!
Has anyone succeeded in reproducing the results (accuracy) obtained in the OFA paper, on SNLI-VE (visual entailment), with from_pretrained() models and fine tuning with huggingface Trainer()? I tried to with OFA-tiny and OFA-base, but though validation accuracy shows normal progress during training and confusion matrix seems normal, it ends around ~10 points below expected performance. I tried to catch all relevant parameters (same prompt including spaces and quotation marks, image mean & std at 0.5, different image size between models, encoder_drop_path_rate = 0.1, decoder_drop_path_rate = 0.1, 5 epochs, warmup ratio = 0.06, peak lr = 3e-5 then decreasing, AdamW weight decay = 1e-2...) but I may have missed something.
Thanks for your interest :-)
François
Sorry I have never tried HF trainer. But my colleagues recently supported the training of OFA with our HF code, see this repo: https://github.com/OFA-Sys/OFA-Compress