|
--- |
|
language: |
|
- en |
|
- es |
|
--- |
|
|
|
# UPB's Multi-task Learning model for AuTexTification |
|
|
|
This is a model for classifying text as human- or LLM-generated. |
|
|
|
This model was trained for one of University Politehnica of Bucharest's (UPB) |
|
submissions to the [AuTexTification shared |
|
task](https://sites.google.com/view/autextification/home). |
|
|
|
This model was trained using multi-task learning to predict whether a text |
|
document was written by a human or a large language model, and whether it was |
|
written in English or Spanish. |
|
|
|
The model outputs a score/probability for each task, but it also makes a binary |
|
prediction for detecting synthetic text, based on a threshold. |
|
|
|
## Training data |
|
|
|
The model was trained on approximately 33,845 English documents and 32,062 |
|
Spanish documents, covering five different domains, such as legal or social |
|
media. The dataset is available on Zenodo (more instructions |
|
[here](https://sites.google.com/view/autextification/data)). |
|
|
|
## Evaluation results |
|
|
|
These results were computed as part of the [AuTexTification shared |
|
task](https://sites.google.com/view/autextification/results): |
|
|
|
| Language | Macro F1 | Confidence Interval| |
|
|:---------|:--------:|:------------------:| |
|
| English | 65.53 | (64.92, 66.23) | |
|
| Spanish | 65.01 | (64.58, 65.64) | |
|
|
|
## Using the model |
|
|
|
You can load the model and its tokenizer using `AutoModel` and `AutoTokenizer`. |
|
|
|
This is an example of using the model for inference: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
checkpoint = "pandrei7/autextification-upb-mtl" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True) |
|
|
|
texts = ["Enter your text here."] |
|
tokenized_batch = tokenizer( |
|
texts, |
|
padding=True, |
|
truncation=True, |
|
max_length=512, |
|
return_tensors="pt", |
|
) |
|
|
|
model.eval() |
|
with torch.no_grad(): |
|
preds = model(tokenized_batch) |
|
|
|
print("Bot?\t", preds["is_bot"][0].item()) |
|
print("Bot score\t", preds["bot_prob"][0].item()) |
|
print("English score\t", preds["english_prob"][0].item()) |
|
``` |
|
|