bhadresh-savani/distilbert-base-uncased-emotion · How to evaluate performance of this model?

Hello,

I am just starting to learn NLP and deep learning. I am curious how is the performance of this model measured? Can someone share the code to do so?

The model card claims a performance > 0.9 but when I test it, I got performance around 0.88. Here is my code that I used to measure performance: https://gist.github.com/siddhsql/41977a67e77470418b7971b6db70b61e

training set...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [06:26<00:00, 1.55s/it]
{'eval_loss': 0.21858924627304077, 'eval_accuracy': 0.919, 'eval_f1': 0.9157014585098928, 'eval_runtime': 387.9855, 'eval_samples_per_second': 41.239, 'eval_steps_per_second': 0.644}
validation set...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:41<00:00, 1.29s/it]
{'eval_loss': 0.3313751816749573, 'eval_accuracy': 0.884, 'eval_f1': 0.8801879230740504, 'eval_runtime': 42.4234, 'eval_samples_per_second': 47.144, 'eval_steps_per_second': 0.754}
test set...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:35<00:00, 1.11s/it]
{'eval_loss': 0.31994158029556274, 'eval_accuracy': 0.887, 'eval_f1': 0.8811062509304752, 'eval_runtime': 36.7283, 'eval_samples_per_second': 54.454, 'eval_steps_per_second': 0.871}