FremyCompany
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -33,7 +33,7 @@ base_model:
|
|
33 |
|
34 |
π§πͺ The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition) is the perfect model for embedding at the speed of light texts of up to 8192 tokens written in French, Dutch, German or English. It produces embeddings very similar across languages
|
35 |
|
36 |
-
π For each input text, the FMMB model autodetects the most efficient tokenizer (English, French, Dutch, or German) and
|
37 |
|
38 |
π This [sentence-transformers](https://www.SBERT.net) model was trained on a small parallel corpus containing English-French, English-Dutch, and English-German sentence pairs. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The input texts can be used as-is, no need to use prefixes.
|
39 |
|
|
|
33 |
|
34 |
π§πͺ The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition) is the perfect model for embedding at the speed of light texts of up to 8192 tokens written in French, Dutch, German or English. It produces embeddings very similar across languages
|
35 |
|
36 |
+
π For each input text, the FMMB model autodetects the most efficient tokenizer (English, French, Dutch, or German) and routes the input text to that tokenizer. Each tokenizer uses its own language-specific token embeddings, reducing the risk of language interference. Because all the other weights are shared, the FMMB models can mix and match different languages in the same batch without requiring to load 4 different models in memory. That said: if you know the tokenizer you want to use in advance, you can use the monolingual variants for [French](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-FR), [Dutch](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-NL), [German](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE) or [English](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-EN) for a faster tokenization and lower memory footprint.
|
37 |
|
38 |
π This [sentence-transformers](https://www.SBERT.net) model was trained on a small parallel corpus containing English-French, English-Dutch, and English-German sentence pairs. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The input texts can be used as-is, no need to use prefixes.
|
39 |
|