|
--- |
|
pipeline_tag: text-classification |
|
language: |
|
- multilingual |
|
license: apache-2.0 |
|
library_name: transformers |
|
--- |
|
|
|
 |
|
|
|
# Model Description |
|
|
|
This model was built by translating the fineweb-edu annotations into 15 languages using a state-of-the-art proprietary LLM for translation, TowerLLM 70B. |
|
|
|
The translation model excels at translating entire documents and thus it is the perfect fit to translate the texts we will use to train our classifier. |
|
|
|
The classifier is trained for English, German, Spanish, Japanese, Chinese, Russian, Hindi, Czech, Ukrainian, Icelandic, Portuguese, French, Dutch, Italian and Korean. Since it is built on top of [mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) it should be able to generalize across other languages. |
|
|
|
## Running Model: |
|
To run inference you must install |
|
``` |
|
pip install transformers[torch] |
|
pip install datasets |
|
pip install pandas |
|
pip install tqdm |
|
``` |
|
|
|
After installing those libraries you can run the following code: |
|
|
|
```python |
|
import pandas as pd |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
from tqdm import tqdm |
|
|
|
|
|
device = "cuda" |
|
path = "utter-project/EuroFilter-v1" |
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
path, |
|
device_map=device, |
|
trust_remote_code=True, |
|
torch_dtype=torch.bfloat16 |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True) |
|
|
|
def get_model_outputs(texts): |
|
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to(model.device) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
score = outputs.logits |
|
prob = torch.nn.functional.sigmoid(outputs.binary_logits) |
|
return score.cpu(), prob.cpu() |
|
|
|
def batchify_texts(texts, batch_size): |
|
for i in range(0, len(texts), batch_size): |
|
yield texts[i:i + batch_size] |
|
|
|
# TODO: replace the next line with the texts you want to classify |
|
texts = LIST_WITH_TEXTS_TO_CLASSIFY |
|
batch_size = 64 # Adjust based on your available memory and model capacity |
|
num_batches = (len(texts) + batch_size - 1) // batch_size |
|
|
|
all_scores = [] |
|
all_probs = [] |
|
with tqdm(total=num_batches, dynamic_ncols=True) as pbar: |
|
for batch_num, batch in enumerate(batchify_texts(texts, batch_size), 1): |
|
score, probs = get_model_outputs(batch) |
|
all_scores.append(score) |
|
all_probs.append(probs) |
|
pbar.set_description(f"Processing Batch {batch_num}/{num_batches}") |
|
pbar.update(1) |
|
|
|
# SCORES is the output of the regression head and should reflect the |
|
# educational score of the text! |
|
scores = torch.cat(all_scores, dim=0).squeeze() |
|
|
|
## BINARY_PRED is the output of the classification head that tells |
|
# if a text has an acceptable educational score or not. |
|
# NOTE: Converting the scores into binary predictions is also possible |
|
all_probs = torch.cat(all_probs, dim=0).squeeze() |
|
binary_pred = (all_probs >= 0.5).numpy().astype(int) |
|
``` |
|
|
|
## English Results: |
|
|
|
When testing the model on an english partition with 37537 samples the results are comparable to the original FineEdu-classifier. |
|
|
|
Regression head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.80 0.53 0.64 5130 |
|
1 0.80 0.88 0.83 21602 |
|
2 0.63 0.58 0.61 7849 |
|
3 0.54 0.62 0.58 2310 |
|
4 0.62 0.48 0.54 645 |
|
5 0.00 0.00 0.00 1 |
|
|
|
accuracy 0.74 37537 |
|
macro avg 0.56 0.51 0.53 37537 |
|
weighted avg 0.74 0.74 0.74 37537 |
|
``` |
|
|
|
Binary head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.98 0.97 0.98 34581 |
|
1 0.71 0.74 0.73 2956 |
|
|
|
accuracy 0.96 37537 |
|
macro avg 0.85 0.86 0.85 37537 |
|
weighted avg 0.96 0.96 0.96 37537 |
|
``` |
|
|
|
## Multilingual Results: |
|
|
|
If we evaluate on the same texts translated into 15 different languages are almost identical! |
|
|
|
Regression head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.80 0.50 0.61 5130 |
|
1 0.79 0.87 0.83 21602 |
|
2 0.61 0.58 0.59 7849 |
|
3 0.52 0.61 0.56 2310 |
|
4 0.61 0.38 0.47 645 |
|
5 0.00 0.00 0.00 1 |
|
|
|
accuracy 0.73 37537 |
|
macro avg 0.55 0.49 0.51 37537 |
|
weighted avg 0.73 0.73 0.73 37537 |
|
``` |
|
|
|
Binary head results: |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.98 0.97 0.97 34581 |
|
1 0.70 0.71 0.71 2956 |
|
|
|
accuracy 0.95 37537 |
|
macro avg 0.84 0.84 0.84 37537 |
|
weighted avg 0.95 0.95 0.95 37537 |
|
``` |
|
|
|
## Citation |
|
If you use our work, please cite: |
|
``` |
|
@misc{martins2025eurollm9B, |
|
title={EuroLLM-9B: Technical Report}, |
|
author={Pedro Henrique Martins and João Alves and Patrick Fernandes and and Nuno M. Guerreiro and Ricardo Rei and Amin Farajian and Mateusz Klimaszewski and Duarte M. Alves and José Pombal and Manuel Faysse and Pierre Colombo and François Yvon and Barry Haddow and José G. C. de Souza and Alexandra Birch and André F. T. Martins}, |
|
year={2025}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
} |
|
``` |
|
|