|
--- |
|
library_name: peft |
|
datasets: |
|
- xnli |
|
license: cc-by-nc-4.0 |
|
pipeline_tag: sentence-similarity |
|
--- |
|
These are LoRA adaption weights for the [mT5](https://huggingface.co/google/mt5-xxl) encoder. |
|
|
|
## Multilingual Sentence T5 (m-ST5) |
|
This model is a multilingual extension of Sentence T5 and was created using the [mT5](https://huggingface.co/google/mt5-xxl) encoder. It is proposed in this [paper](https://arxiv.org/abs/2403.17528). |
|
m-ST5 is an encoder for sentence embedding, and its performance has been verified in cross-lingual semantic textual similarity (STS) and sentence retrieval tasks. |
|
|
|
### Training Data |
|
The model was trained on the XNLI dataset. |
|
|
|
### Framework versions |
|
|
|
|
|
- PEFT 0.4.0.dev0 |
|
|
|
## How to use |
|
0. If you have not installed peft, please do so. |
|
``` |
|
pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git |
|
``` |
|
1. Load the model. |
|
``` |
|
from transformers import MT5EncoderModel |
|
from peft import PeftModel |
|
|
|
model = MT5EncoderModel.from_pretrained("google/mt5-xxl") |
|
model.enable_input_require_grads() |
|
model.gradient_checkpointing_enable() |
|
model: PeftModel = PeftModel.from_pretrained(model, "pkshatech/m-ST5") |
|
``` |
|
2. To obtain sentence embedding, use mean pooling. |
|
``` |
|
tokenizer = AutoTokenizer.from_pretrained("google/mt5-xxl", use_fast=False) |
|
model.eval() |
|
|
|
texts = ["I am a dog.","You are a cat."] |
|
inputs = tokenizer( |
|
texts, |
|
padding=True, |
|
truncation=True, |
|
return_tensors="pt", |
|
) |
|
outputs = model(**inputs) |
|
last_hidden_state = outputs.last_hidden_state |
|
last_hidden_state[inputs.attention_mask == 0, :] = 0 |
|
sent_len = inputs.attention_mask.sum(dim=1, keepdim=True) |
|
sent_emb = last_hidden_state.sum(dim=1) / sent_len |
|
``` |
|
|
|
## BenchMarks |
|
- Tatoeba: Sentence retrieval tasks with pairs of English sentences and sentences in other languages. |
|
- BUCC: Bitext mining task. It consists of English and one of the 4 languages (German, French, Russian and Chinese). |
|
- XSTS: Cross-lingual semantic textual similarity task. |
|
|
|
|
|
Please check the paper for details and more. |
|
|
|
| | Tatoeba-14 | Tatoeba-36 | BUCC | XSTS<br>(ar-ar)|XSTS<br>(ar-en)|XSTS<br>(es-es)|XSTS<br>(es-en)|XSTS<br>(tr-en)| |
|
| ----- | :----------: | :----------: | :----: | :---:|:----:|:----:|:----:|:----:| |
|
| m-ST5 | 96.3 | 94.7 | 97.6 | 76.2|78.6|84.4|76.2|75.1| |
|
| LaBSE | 95.3 | 95.0 | 93.5 | 69.1|74.5|80.8|65.5|72.0| |
|
|