Andrianos's picture
Updated Readme
6ed6856 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - dataset_size:120000
  - multilingual
base_model: Alibaba-NLP/gte-multilingual-base
widget:
  - source_sentence: Who is filming along?
    sentences:
      - Wién filmt mat?
      - >-
        Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
        krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer
        hätt.
      - Brambilla 130.08.03 St.
  - source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
    sentences:
      - >-
        Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai,
        do gëtt jo een ganz neie Wunnquartier gebaut.
      - >-
        D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden
        wor re eso'gucr me' we' 90 prozent.
      - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
  - source_sentence: >-
      Non-profit organisation Passerell, which provides legal council to
      refugees in Luxembourg, announced that it has to make four employees
      redundant in August due to a lack of funding.
    sentences:
      - Oetringen nach Remich....8.20» 215»
      - >-
        D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache
        Rechtsfroe këmmert, wäert am August mussen hir véier fix Salariéen
        entloossen.
      - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
  - source_sentence: This regulation was temporarily lifted during the Covid pandemic.
    sentences:
      - Six Jours vu New-York si fir d’équipe Girgetti  Debacco
      - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
      - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
  - source_sentence: The cross-border workers should also receive more wages.
    sentences:
      - D'grenzarbechetr missten och me' lo'n kre'en.
      - >-
        De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der
        Bréck gemâcht!
      - >-
        D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
        verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
  - name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
    results:
      - task:
          type: contemporary-lb
          name: Contemporary-lb
        dataset:
          name: Contemporary-lb
          type: contemporary-lb
        metrics:
          - type: accuracy
            value: 0.6216
            name: SIB-200(LB) accuracy
          - type: accuracy
            value: 0.6282
            name: ParaLUX accuracy
      - task:
          type: bitext-mining
          name: LBHistoricalBitextMining
        dataset:
          name: LBHistoricalBitextMining
          type: lb-en
        metrics:
          - type: accuracy
            value: 0.9683
            name: LB<->FR accuracy
          - type: accuracy
            value: 0.9715
            name: LB<->EN accuracy
          - type: mean_accuracy
            value: 0.9793
            name: LB<->DE accuracy
license: agpl-3.0
datasets:
  - impresso-project/HistLuxAlign
  - fredxlpy/LuxAlign
language:
  - lb

Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an Alibaba-NLP/gte-multilingual-base model that was further adapted by (Michail et al., 2025)

Limitations

We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use histlux-paraphrase-multilingual-mpnet-base-v2

Model Description

  • Model Type: GTE-Multilingual-Base
  • Base model: Alibaba-NLP/gte-multilingual-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • LB-EN (Historical, Modern)

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

Metrics

(see introducing paper) Historical Bitext Mining (Accuracy):

LB -> FR: 96.8

FR -> LB: 96.9

LB -> EN: 97.2

EN -> LB: 97.2

LB -> DE: 98.0

DE -> LB: 91.8

Contemporary LB (Accuracy): ParaLUX: 62.82

SIB-200(LB): 62.16

Training Details

Training Dataset

The parallel sentences data mix is the following:

impresso-project/HistLuxAlign:

  • LB-FR (x20,000)
  • LB-EN (x20,000)
  • LB-DE (x20,000)

fredxlpy/LuxAlign:

  • LB-FR (x40,000)
  • LB-EN (x20,000)

Total: 120 000 Sentence pairs in mixed batches of size 8

Contrastive Training

The model was trained with the parameters:

**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}


Parameters of the fit()-Method:

{ "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", }


Citation

BibTeX

Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

@misc{michail2025adaptingmultilingualembeddingmodels,
      title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, 
      author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
      year={2025},
      eprint={2502.07938},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07938}, 
}

Original Multilingual GTE Model

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}