Update README.md

For me, I get a problem with the padding token (due to modernbert), even when updating the libraries. maybe it is a technical thing for everyone? I just assigned in my version the pad token to the eos token. maybe @FremyCompany it makes sense too or am I missing something?

FremyCompany

Parallia org 4 days ago

Oh, right. You have to assign a token to be a padding token if you run batched inference. Normally ModernBERT doesn't require that but I couldn't get it working with SentenceTransformers. Cc @tomaarsen .

The fix is:

tokenizer.pad_token=tokenizer.eos_token

I'll update the tokenizer_config.json to have a padding token when I get back at my desk.

tomaarsen

4 days ago

Could either of you provide a small snippet that breaks due to the pad_token? Then I can try and chase it down.

Tom Aarsen

aari1995

4 days ago

@tomaarsen

broken:


from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE")

# Run inference
sentences = [
    'Diese drei geheimnisvollen Männer kamen uns dann zu Hilfe.',
    'Drei ziemlich seltsame Typen halfen uns danach.',
    'Diese drei schwarzen Vögel sahen dann in unseren Garten.',
    'Einige Leute sind hilfsbereit.',
    'Un, zwei, drei... Wer kann die nächsten Zahlen erraten?',
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# [5, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)

FremyCompany

Parallia org 3 days ago

@aari1995 FWIW, this works now. But what is going on is that sentence_transformerspads all sequences to have the same lenght using [PAD] tokens, but from what I understand, ModernBERT supports "Unpadded" inference where instead the sequences in the batch are concatenated and a smart attention mask is used to prevent contaminations. I don't think that SentenceTransformers already support that, which is why it was failing when the [PAD] token was not defined. I did not notice it because I trained using the models with 4 tokenizers which inherited the PAD token from the first tokenizer.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment