Correct language

FremyCompany changed pull request status to merged

Thank you @aari1995 , good catch!
A small copy-paste error indeed. I fixed it in all models now.

perfect! :-) good model(s)!

For me, I get a problem with the padding token (due to modernbert), even when updating the libraries. maybe it is a technical thing for everyone? I just assigned in my version the pad token to the eos token. maybe @FremyCompany it makes sense too or am I missing something?

Parallia org

Oh, right. You have to assign a token to be a padding token if you run batched inference. Normally ModernBERT doesn't require that but I couldn't get it working with SentenceTransformers. Cc @tomaarsen .

The fix is:

tokenizer.pad_token=tokenizer.eos_token

I'll update the tokenizer_config.json to have a padding token when I get back at my desk.

Could either of you provide a small snippet that breaks due to the pad_token? Then I can try and chase it down.

  • Tom Aarsen

@tomaarsen

broken:


from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE")

# Run inference
sentences = [
    'Diese drei geheimnisvollen Männer kamen uns dann zu Hilfe.',
    'Drei ziemlich seltsame Typen halfen uns danach.',
    'Diese drei schwarzen Vögel sahen dann in unseren Garten.',
    'Einige Leute sind hilfsbereit.',
    'Un, zwei, drei... Wer kann die nächsten Zahlen erraten?',
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# [5, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Parallia org

@aari1995 FWIW, this works now. But what is going on is that sentence_transformerspads all sequences to have the same lenght using [PAD] tokens, but from what I understand, ModernBERT supports "Unpadded" inference where instead the sequences in the batch are concatenated and a smart attention mask is used to prevent contaminations. I don't think that SentenceTransformers already support that, which is why it was failing when the [PAD] token was not defined. I did not notice it because I trained using the models with 4 tokenizers which inherited the PAD token from the first tokenizer.

Sign up or log in to comment