NiGuLa commited on
Commit
6d7c08c
1 Parent(s): 8bcf4a8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - detoxification
5
+ - style_transfer
6
+ license: mit
7
+ datasets:
8
+ - textdetox/multilingual_paradetox
9
+ language:
10
+ - en
11
+ - ar
12
+ - am
13
+ - zh
14
+ - uk
15
+ - hi
16
+ - es
17
+ - ru
18
+ - de
19
+ metrics:
20
+ - chrf
21
+ pipeline_tag: text2text-generation
22
+ ---
23
+
24
+ # mBART-Large multilingual detoxification model
25
+
26
+ This is a detoxification model trained on released parallel corpus (dev part) of toxic texts [MultiParadetox](https://huggingface.co/datasets/textdetox/multilingual_paradetox)
27
+
28
+
29
+ ## Model Details
30
+
31
+ The base model for this fine-tune is [mbart-large-50](https://huggingface.co/facebook/mbart-large-50).
32
+
33
+ The model shows the following metrics on test set
34
+
35
+ | | STA | SIM | CHRF | J |
36
+ |---|---|---|---|---|
37
+ | Amharic | 0.51 | 0.91 | 0.41 | 0.20 |
38
+ | Arabic | 0.56 | 0.95 | 0.74 | 0.40 |
39
+ | Chinese | 0.17 | 0.96 | 0.43 | 0.07 |
40
+ | English | 0.49 | 0.93 | 0.70 | 0.34 |
41
+ | German | 0.53 | 0.97 | 0.79 | 0.41 |
42
+ | Hindi | 0.23 | 0.94 | 0.70 | 0.17 |
43
+ | Russian | 0.45 | 0.94 | 0.71 | 0.32 |
44
+ | Spanish | 0.47 | 0.93 | 0.64 | 0.29 |
45
+ | Ukrainian | 0.46 | 0.94 | 0.75 | 0.35 |
46
+
47
+ **STA** - style accuracy
48
+
49
+ **SIM** - content similarity
50
+
51
+ **CHRF** - Fluency
52
+
53
+ **J** - joint
54
+
55
+ For more details about the metrics and data refer to the shared task page and the papers mentioned in citations section.
56
+
57
+ ## Citation
58
+
59
+ The model is developed as a baseline for [TextDetox CLEF-2024](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) shared task.
60
+
61
+ If you would like to acknowledge our work, please, cite the following manuscripts:
62
+
63
+ ```
64
+ @inproceedings{dementieva2024overview,
65
+ title={Overview of the Multilingual Text Detoxification Task at PAN 2024},
66
+ author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander},
67
+ booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum},
68
+ editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{\v{s}}{\v{c}}{\'a}kov{\'a} and Alba Garc{\'i}a Seco de Herrera},
69
+ year={2024},
70
+ organization={CEUR-WS.org}
71
+ }
72
+ ```
73
+
74
+ ```
75
+ @inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24,
76
+ author = {Janek Bevendorff and
77
+ Xavier Bonet Casals and
78
+ Berta Chulvi and
79
+ Daryna Dementieva and
80
+ Ashaf Elnagar and
81
+ Dayne Freitag and
82
+ Maik Fr{\"{o}}be and
83
+ Damir Korencic and
84
+ Maximilian Mayerl and
85
+ Animesh Mukherjee and
86
+ Alexander Panchenko and
87
+ Martin Potthast and
88
+ Francisco Rangel and
89
+ Paolo Rosso and
90
+ Alisa Smirnova and
91
+ Efstathios Stamatatos and
92
+ Benno Stein and
93
+ Mariona Taul{\'{e}} and
94
+ Dmitry Ustalov and
95
+ Matti Wiegmann and
96
+ Eva Zangerle},
97
+ editor = {Nazli Goharian and
98
+ Nicola Tonellotto and
99
+ Yulan He and
100
+ Aldo Lipani and
101
+ Graham McDonald and
102
+ Craig Macdonald and
103
+ Iadh Ounis},
104
+ title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual
105
+ Text Detoxification, Oppositional Thinking Analysis, and Generative
106
+ {AI} Authorship Verification - Extended Abstract},
107
+ booktitle = {Advances in Information Retrieval - 46th European Conference on Information
108
+ Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
109
+ Part {VI}},
110
+ series = {Lecture Notes in Computer Science},
111
+ volume = {14613},
112
+ pages = {3--10},
113
+ publisher = {Springer},
114
+ year = {2024},
115
+ url = {https://doi.org/10.1007/978-3-031-56072-9\_1},
116
+ doi = {10.1007/978-3-031-56072-9\_1},
117
+ timestamp = {Fri, 29 Mar 2024 23:01:36 +0100},
118
+ biburl = {https://dblp.org/rec/conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24.bib},
119
+ bibsource = {dblp computer science bibliography, https://dblp.org}
120
+ }
121
+ ```