|
--- |
|
library_name: transformers |
|
datasets: |
|
- s-nlp/EverGreen-Multilingual |
|
language: |
|
- ru |
|
- en |
|
- fr |
|
- de |
|
- he |
|
- ar |
|
- zh |
|
base_model: |
|
- intfloat/multilingual-e5-large-instruct |
|
pipeline_tag: text-classification |
|
--- |
|
# E5-EG-large |
|
|
|
A lightweight multilingual model for temporal classification of questions, fine-tuned from [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
E5-EG-small (E5 EverGreen - Large) is an efficient multilingual text classification model that determines whether questions have temporally mutable or immutable answers. This model offers a balanced trade-off between performance and computational efficiency. |
|
|
|
- **Model type:** Text Classification |
|
- **Base model:** [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-large-instruct) |
|
- **Language(s):** Russian, English, French, German, Hebrew, Arabic, Chinese |
|
- **License:** MIT |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [GitHub](https://github.com/s-nlp/Evergreen-classification) |
|
- **Paper:** [Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA](https://arxiv.org/abs/2505.21115) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import pipeline |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "s-nlp/E5-EverGreen-Multilingual-Large" |
|
pipe = pipeline("text-classification", model_name) |
|
|
|
# Batch classification example |
|
questions = [ |
|
"What is the capital of France?", |
|
"Who won the latest World Cup?", |
|
"What is the speed of light?", |
|
"What is the current Bitcoin price?" |
|
"How old is Elon Musk", |
|
"How old was Leo Tolstoy when he died?" |
|
] |
|
|
|
# Classify |
|
results = pipe(questions) |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Same multilingual dataset as E5-EG-small: |
|
- ~4,000 questions per language |
|
- Balanced class distribution |
|
- Augmented with synthetic and translated data |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
- Identical to E5-EG-small |
|
- Maximum sequence length: 64 tokens |
|
- Multilingual tokenization |
|
|
|
#### Training Hyperparameters |
|
- **Training regime:** fp16 mixed precision |
|
- **Epochs:** 10 |
|
- **Batch size:** 32 |
|
- **Learning rate:** 5e-05 |
|
- **Warmup steps:** 300 |
|
- **Weight decay:** 0.01 |
|
- **Optimizer:** AdamW |
|
- **Loss function:** Focal Loss (γ=2.0, α=0.25) with class weighting |
|
- **Gradient accumulation steps:** 1 |
|
|
|
#### Hardware |
|
- **GPUs:** Single NVIDIA V100 |
|
- **Training time:** ~8 hours |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
|
|
Same test sets as E5-EG-large (2100 samples per language). |
|
|
|
|
|
### Metrics |
|
|
|
#### Overall Performance |
|
| Metric | Score | |
|
|--------|-------| |
|
| Overall F1 | 0.89 | |
|
| Overall Accuracy | 0.88 | |
|
|
|
#### Per-Language F1 Scores |
|
| Language | F1 Score | |
|
|----------|----------| |
|
| English | 0.92 | |
|
| Chinese | 0.91 | |
|
| French | 0.90 | |
|
| German | 0.89 | |
|
| Russian | 0.88 | |
|
| Hebrew | 0.87 | |
|
| Arabic | 0.86 | |
|
|
|
#### Class-wise Performance |
|
| Class | Precision | Recall | F1 | |
|
|-------|-----------|--------|-----| |
|
| Immutable | 0.87 | 0.90 | 0.88 | |
|
| Mutable | 0.90 | 0.87 | 0.88 | |
|
|
|
### Model Comparison |
|
|
|
| Model | Parameters | Overall F1 | Inference Time (ms) | |
|
|-------|------------|------------|---------------------| |
|
| E5-EG-large | 560M | 0.89 | 45 | |
|
| E5-EG-small | 118M | 0.85 | 12 | |
|
| mDeBERTa-base | 278M | 0.87 | 28 | |
|
| mBERT | 177M | 0.85 | 20 | |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@misc{pletenev2025truetomorrowmultilingualevergreen, |
|
title={Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA}, |
|
author={Sergey Pletenev and Maria Marina and Nikolay Ivanov and Daria Galimzianova and Nikita Krayko and Mikhail Salnikov and Vasily Konovalov and Alexander Panchenko and Viktor Moskvoretskii}, |
|
year={2025}, |
|
eprint={2505.21115}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2505.21115}, |
|
} |
|
``` |