T-Systems-onsite/cross-en-de-roberta-sentence-transformer

Sep 18

•

Hello, i've had some problems using this model. Mainly because the sentence-transformers library is quite sensitive in terms of missing files. I've decided to add the missing files from the base model (https://huggingface.co/sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1) and add the "clean_up_tokenization_spaces": true parameter to the tokenizer-config.json.

Those warnings were escalated to an error by the prompt benchmark tool I'm using, and I thought you could make use of this change and avoid future issues.

The Benchmark scores changed minimally:
Original DE:0.8549768717756436
Updated DE:0.8549777340634312 (slight increase)
Original EN:0.8660333530928567
Updated EN:0.8660334102061337 (slight increase)
Original Cross:0.8525445612883897
Updated Cross:0.8525444308395488 (slight decrease)

fix: add missing files from base-model and set clean_up_tokenization_spaces = True to fix the warnings1c87fa13

jimmymeister changed pull request status to open Sep 18

T-Systems-onsite
/

cross-en-de-roberta-sentence-transformer

warnings-fix