pandrei7
/

autextification-upb-mtl

Feature Extraction

custom-text-classifier

Model card Files Files and versions Community

pandrei7 commited on Jul 23, 2023

Commit

80a4791

·

1 Parent(s): 56dfd9c

docs: Create a model card

Files changed (1) hide show

README.md +69 -0

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+language:
+- en
+- es
+---
+# UPB's Multi-task Learning model for AuTexTification
+This is a model for classifying text as human- or LLM-generated.
+This model was trained for one of University Politehnica of Bucharest's (UPB)
+submissions to the [AuTexTification shared
+task](https://sites.google.com/view/autextification/home).
+This model was trained using multi-task learning to predict whether a text
+document was written by a human or a large language model, and whether it was
+written in English or Spanish.
+The model outputs a score/probability for each task, but it also makes a binary
+prediction for detecting synthetic text, based on a threshold.
+## Training data
+The model was trained on approximately 33,845 English documents and 32,062
+Spanish documents, covering five different domains, such as legal or social
+media. The dataset is available on Zenodo (more instructions
+[here](https://sites.google.com/view/autextification/data)).
+## Evaluation results
+These results were computed as part of the [AuTexTification shared
+task](https://sites.google.com/view/autextification/results):
+| Language | Macro F1 | Confidence Interval|
+|:---------|:--------:|:------------------:|
+| English  | 65.53    | (64.92, 66.23)     |
+| Spanish  | 65.01    | (64.58, 65.64)     |
+## Using the model
+You can load the model and its tokenizer using `AutoModel` and `AutoTokenizer`.
+This is an example of using the model for inference:
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+checkpoint = "pandrei7/autextification-upb-mtl"
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True)
+texts = ["Enter your text here."]
+tokenized_batch = tokenizer(
+    texts,
+    padding=True,
+    truncation=True,
+    max_length=512,
+    return_tensors="pt",
+)
+model.eval()
+with torch.no_grad():
+    preds = model(tokenized_batch)
+print("Bot?\t", preds["is_bot"][0].item())
+print("Bot score\t", preds["bot_prob"][0].item())
+print("English score\t", preds["english_prob"][0].item())
+```