vukbatanovic commited on
Commit
13e40a9
·
verified ·
1 Parent(s): 4887e68

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - sr
5
+ metrics:
6
+ - f1
7
+ - accuracy
8
+ base_model:
9
+ - classla/bcms-bertic
10
+ pipeline_tag: token-classification
11
+ library_name: transformers
12
+ tags:
13
+ - legal
14
+ ---
15
+
16
+ # BERTić-COMtext-SR-legal-NER-ijekavica
17
+
18
+ **BERTić-COMtext-SR-legal-NER-ijekavica** is a variant of the [BERTić](https://huggingface.co/classla/bcms-bertic) model, fine-tuned on the task of named entity recognition in Serbian legal texts written in the Ijekavian pronunciation.
19
+ The model was fine-tuned for 20 epochs on the Ijekavian variant of the [COMtext.SR.legal](https://github.com/ICEF-NLP/COMtext.SR) dataset.
20
+
21
+ # Benchmarking
22
+
23
+ This model was evaluated on the task of named entity recognition in Serbian legal texts.
24
+ The model uses a newly developed named entity schema consisting of 21 entity types, tailored for the domain of Serbian legal texts, and encoded according the the IOB2 standard.
25
+ The full entity list is available on the [COMtext.SR GitHub repository](https://github.com/ICEF-NLP/COMtext.SR).
26
+
27
+ This model was compared with [SrBERTa](http://huggingface.co/nemanjaPetrovic/SrBERTa), a model specially trained on Serbian legal texts, fine-tuned for 20 epochs for named entity recognition using the Ijekavian variant of the [COMtext.SR.legal](https://github.com/ICEF-NLP/COMtext.SR) corpus of legal texts. Token-level accuracy and F1 (macro-averaged and per-class) were used as evaluation metrics and gold tokenized text was taken as input.
28
+
29
+ Two evaluation settings for both models were considered:
30
+ * Default - only the entity type portion of the NE tag is considered, effectively ignoring the "B-" and "I-" prefixes
31
+ * Strict - the entire NE tag is considered
32
+
33
+ For the strict setting, per-class results are given separately for each B-CLASS and I-CLASS tag.
34
+ In addition, macro-averaged F1 scores are presented in two variants - one where the O (outside) class is ignored, and another where it is treated equally to other named entity classes.
35
+
36
+ BERTić-COMtext-SR-legal-NER-ijekavica and SrBERTa were fine-tuned and evaluated on the COMtext.SR.legal.ijekavica corpus using 10-fold CV.
37
+
38
+ The code and data to run these experiments is available on the [COMtext.SR GitHub repository](https://github.com/ICEF-NLP/COMtext.SR).
39
+
40
+ ## Results
41
+
42
+ | Metrics | BERTić-COMtext-SR-legal-NER-ijekavica (default) | BERTić-COMtext-SR-legal-NER-ijekavica (strict) | SrBERTa (default) | SrBERTa (strict) |
43
+ | -------------------- | ----------------------------------------------- | ---------------------------------------------- | ----------------- | ---------------- |
44
+ | Accuracy | **0.9839** | 0.9828 | 0.9688 | 0.9672 |
45
+ | Macro F1 (with O) | **0.8563** | 0.8474 | 0.7479 | 0.7225 |
46
+ | Macro F1 (without O) | **0.8403** | 0.8396 | 0.7328 | 0.7128 |
47
+ | *Per-class F1* | | | | |
48
+ | PER | 0.9856 | 0.9780 / 0.9765 | 0.8720 | 0.8177 / 0.9068 |
49
+ | LOC | 0.8933 | 0.9003 / 0.8134 | 0.6670 | 0.6666 / 0.5995 |
50
+ | ADR | 0.9253 | 0.9132 / 0.9161 | 0.8554 | 0.7806 / 0.8393 |
51
+ | COURT | 0.9427 | 0.9515 / 0.9340 | 0.8488 | 0.8417 / 0.8524 |
52
+ | INST | 0.8044 | 0.8152 / 0.8261 | 0.6793 | 0.6376 / 0.6420 |
53
+ | COM | 0.7225 | 0.7326 / 0.6782 | 0.4815 | 0.3632 / 0.4767 |
54
+ | OTHORG | 0.4670 | 0.3436 / 0.6080 | 0.2557 | 0.0609 / 0.3664 |
55
+ | LAW | 0.9523 | 0.9463 / 0.9511 | 0.9147 | 0.8868 / 0.9128 |
56
+ | REF | 0.8125 | 0.7602 / 0.7939 | 0.7564 | 0.6246 / 0.7485 |
57
+ | IDPER | 1.0000 | 1.0000 / N/A | 1.0000 | 1.0000 / N/A |
58
+ | IDCOM | 0.9722 | 0.9722 / N/A | 0.9667 | 0.9667 / N/A |
59
+ | IDTAX | 1.0000 | 1.0000 / N/A | 0.9815 | 0.9815 / N/A |
60
+ | NUMACC | 1.0000 | 1.0000 / N/A | 0.6667 | 0.6667 / N/A |
61
+ | NUMDOC | 0.8148 | 0.8148 / N/A | 0.3333 | 0.3333 / N/A |
62
+ | NUMCAR | 0.6222 | 0.5397 / 0.5000 | 0.4545 | 0.5000 / 0.0000 |
63
+ | NUMPLOT | 0.7088 | 0.7088 / N/A | 0.5479 | 0.5479 / N/A |
64
+ | IDOTH | 0.5949 | 0.5949 / N/A | 0.4776 | 0.4776 / N/A |
65
+ | CONTACT | 0.8000 | 0.8000 / N/A | 0.0000 | 0.0000 / N/A |
66
+ | DATE | 0.9664 | 0.9378 / 0.9615 | 0.9547 | 0.9104 / 0.9480 |
67
+ | MONEY | 0.9741 | 0.9613 / 0.9715 | 0.8825 | 0.8854 / 0.8851 |
68
+ | MISC | 0.4183 | 0.4213 / 0.3874 | 0.1814 | 0.1492 / 0.1694 |
69
+ | O | 0.9942 | 0.9942 | 0.9872 | 0.9872 |