Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- sr
|
5 |
+
metrics:
|
6 |
+
- f1
|
7 |
+
- accuracy
|
8 |
+
base_model:
|
9 |
+
- classla/bcms-bertic
|
10 |
+
pipeline_tag: token-classification
|
11 |
+
library_name: transformers
|
12 |
+
tags:
|
13 |
+
- legal
|
14 |
+
---
|
15 |
+
|
16 |
+
# BERTić-COMtext-SR-legal-NER-ijekavica
|
17 |
+
|
18 |
+
**BERTić-COMtext-SR-legal-NER-ijekavica** is a variant of the [BERTić](https://huggingface.co/classla/bcms-bertic) model, fine-tuned on the task of named entity recognition in Serbian legal texts written in the Ijekavian pronunciation.
|
19 |
+
The model was fine-tuned for 20 epochs on the Ijekavian variant of the [COMtext.SR.legal](https://github.com/ICEF-NLP/COMtext.SR) dataset.
|
20 |
+
|
21 |
+
# Benchmarking
|
22 |
+
|
23 |
+
This model was evaluated on the task of named entity recognition in Serbian legal texts.
|
24 |
+
The model uses a newly developed named entity schema consisting of 21 entity types, tailored for the domain of Serbian legal texts, and encoded according the the IOB2 standard.
|
25 |
+
The full entity list is available on the [COMtext.SR GitHub repository](https://github.com/ICEF-NLP/COMtext.SR).
|
26 |
+
|
27 |
+
This model was compared with [SrBERTa](http://huggingface.co/nemanjaPetrovic/SrBERTa), a model specially trained on Serbian legal texts, fine-tuned for 20 epochs for named entity recognition using the Ijekavian variant of the [COMtext.SR.legal](https://github.com/ICEF-NLP/COMtext.SR) corpus of legal texts. Token-level accuracy and F1 (macro-averaged and per-class) were used as evaluation metrics and gold tokenized text was taken as input.
|
28 |
+
|
29 |
+
Two evaluation settings for both models were considered:
|
30 |
+
* Default - only the entity type portion of the NE tag is considered, effectively ignoring the "B-" and "I-" prefixes
|
31 |
+
* Strict - the entire NE tag is considered
|
32 |
+
|
33 |
+
For the strict setting, per-class results are given separately for each B-CLASS and I-CLASS tag.
|
34 |
+
In addition, macro-averaged F1 scores are presented in two variants - one where the O (outside) class is ignored, and another where it is treated equally to other named entity classes.
|
35 |
+
|
36 |
+
BERTić-COMtext-SR-legal-NER-ijekavica and SrBERTa were fine-tuned and evaluated on the COMtext.SR.legal.ijekavica corpus using 10-fold CV.
|
37 |
+
|
38 |
+
The code and data to run these experiments is available on the [COMtext.SR GitHub repository](https://github.com/ICEF-NLP/COMtext.SR).
|
39 |
+
|
40 |
+
## Results
|
41 |
+
|
42 |
+
| Metrics | BERTić-COMtext-SR-legal-NER-ijekavica (default) | BERTić-COMtext-SR-legal-NER-ijekavica (strict) | SrBERTa (default) | SrBERTa (strict) |
|
43 |
+
| -------------------- | ----------------------------------------------- | ---------------------------------------------- | ----------------- | ---------------- |
|
44 |
+
| Accuracy | **0.9839** | 0.9828 | 0.9688 | 0.9672 |
|
45 |
+
| Macro F1 (with O) | **0.8563** | 0.8474 | 0.7479 | 0.7225 |
|
46 |
+
| Macro F1 (without O) | **0.8403** | 0.8396 | 0.7328 | 0.7128 |
|
47 |
+
| *Per-class F1* | | | | |
|
48 |
+
| PER | 0.9856 | 0.9780 / 0.9765 | 0.8720 | 0.8177 / 0.9068 |
|
49 |
+
| LOC | 0.8933 | 0.9003 / 0.8134 | 0.6670 | 0.6666 / 0.5995 |
|
50 |
+
| ADR | 0.9253 | 0.9132 / 0.9161 | 0.8554 | 0.7806 / 0.8393 |
|
51 |
+
| COURT | 0.9427 | 0.9515 / 0.9340 | 0.8488 | 0.8417 / 0.8524 |
|
52 |
+
| INST | 0.8044 | 0.8152 / 0.8261 | 0.6793 | 0.6376 / 0.6420 |
|
53 |
+
| COM | 0.7225 | 0.7326 / 0.6782 | 0.4815 | 0.3632 / 0.4767 |
|
54 |
+
| OTHORG | 0.4670 | 0.3436 / 0.6080 | 0.2557 | 0.0609 / 0.3664 |
|
55 |
+
| LAW | 0.9523 | 0.9463 / 0.9511 | 0.9147 | 0.8868 / 0.9128 |
|
56 |
+
| REF | 0.8125 | 0.7602 / 0.7939 | 0.7564 | 0.6246 / 0.7485 |
|
57 |
+
| IDPER | 1.0000 | 1.0000 / N/A | 1.0000 | 1.0000 / N/A |
|
58 |
+
| IDCOM | 0.9722 | 0.9722 / N/A | 0.9667 | 0.9667 / N/A |
|
59 |
+
| IDTAX | 1.0000 | 1.0000 / N/A | 0.9815 | 0.9815 / N/A |
|
60 |
+
| NUMACC | 1.0000 | 1.0000 / N/A | 0.6667 | 0.6667 / N/A |
|
61 |
+
| NUMDOC | 0.8148 | 0.8148 / N/A | 0.3333 | 0.3333 / N/A |
|
62 |
+
| NUMCAR | 0.6222 | 0.5397 / 0.5000 | 0.4545 | 0.5000 / 0.0000 |
|
63 |
+
| NUMPLOT | 0.7088 | 0.7088 / N/A | 0.5479 | 0.5479 / N/A |
|
64 |
+
| IDOTH | 0.5949 | 0.5949 / N/A | 0.4776 | 0.4776 / N/A |
|
65 |
+
| CONTACT | 0.8000 | 0.8000 / N/A | 0.0000 | 0.0000 / N/A |
|
66 |
+
| DATE | 0.9664 | 0.9378 / 0.9615 | 0.9547 | 0.9104 / 0.9480 |
|
67 |
+
| MONEY | 0.9741 | 0.9613 / 0.9715 | 0.8825 | 0.8854 / 0.8851 |
|
68 |
+
| MISC | 0.4183 | 0.4213 / 0.3874 | 0.1814 | 0.1492 / 0.1694 |
|
69 |
+
| O | 0.9942 | 0.9942 | 0.9872 | 0.9872 |
|