pandrei7 commited on
Commit
80a4791
·
1 Parent(s): 56dfd9c

docs: Create a model card

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ ---
6
+
7
+ # UPB's Multi-task Learning model for AuTexTification
8
+
9
+ This is a model for classifying text as human- or LLM-generated.
10
+
11
+ This model was trained for one of University Politehnica of Bucharest's (UPB)
12
+ submissions to the [AuTexTification shared
13
+ task](https://sites.google.com/view/autextification/home).
14
+
15
+ This model was trained using multi-task learning to predict whether a text
16
+ document was written by a human or a large language model, and whether it was
17
+ written in English or Spanish.
18
+
19
+ The model outputs a score/probability for each task, but it also makes a binary
20
+ prediction for detecting synthetic text, based on a threshold.
21
+
22
+ ## Training data
23
+
24
+ The model was trained on approximately 33,845 English documents and 32,062
25
+ Spanish documents, covering five different domains, such as legal or social
26
+ media. The dataset is available on Zenodo (more instructions
27
+ [here](https://sites.google.com/view/autextification/data)).
28
+
29
+ ## Evaluation results
30
+
31
+ These results were computed as part of the [AuTexTification shared
32
+ task](https://sites.google.com/view/autextification/results):
33
+
34
+ | Language | Macro F1 | Confidence Interval|
35
+ |:---------|:--------:|:------------------:|
36
+ | English | 65.53 | (64.92, 66.23) |
37
+ | Spanish | 65.01 | (64.58, 65.64) |
38
+
39
+ ## Using the model
40
+
41
+ You can load the model and its tokenizer using `AutoModel` and `AutoTokenizer`.
42
+
43
+ This is an example of using the model for inference:
44
+
45
+ ```python
46
+ import torch
47
+ from transformers import AutoModel, AutoTokenizer
48
+
49
+ checkpoint = "pandrei7/autextification-upb-mtl"
50
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
51
+ model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True)
52
+
53
+ texts = ["Enter your text here."]
54
+ tokenized_batch = tokenizer(
55
+ texts,
56
+ padding=True,
57
+ truncation=True,
58
+ max_length=512,
59
+ return_tensors="pt",
60
+ )
61
+
62
+ model.eval()
63
+ with torch.no_grad():
64
+ preds = model(tokenized_batch)
65
+
66
+ print("Bot?\t", preds["is_bot"][0].item())
67
+ print("Bot score\t", preds["bot_prob"][0].item())
68
+ print("English score\t", preds["english_prob"][0].item())
69
+ ```