Upload README.md
Browse files
README.md
CHANGED
@@ -17,12 +17,23 @@ A named entity recognition system (NER) was trained on text extracted from _Ober
|
|
17 |
|
18 |
## Annotations
|
19 |
|
20 |
-
Each text passage was annotated in [doccano](https://github.com/doccano/doccano) by two or three annotators and their annotations were cleaned and merged into one dataset. For details on how this was done, see [`LelViLamp/kediff-doccano-postprocessing`](https://github.com/LelViLamp/kediff-doccano-postprocessing). In total, the text consists of about 1.7m characters. The resulting annotation datasets were published on the Hugging Face Hub as [`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
The following categories were included in the annotation process:
|
23 |
|
24 |
| Tag | Label | Count | Total Length | Median Annotation Length | Mean Annotation Length | SD |
|
25 |
-
|
26 |
| `EVENT` | Event | 294 | 6,090 | 18 | 20.71 | 13.24 |
|
27 |
| `LOC` | Location | 2,449 | 24,417 | 9 | 9.97 | 6.21 |
|
28 |
| `MISC` | Miscellaneous | 2,585 | 50,654 | 14 | 19.60 | 19.63 |
|
@@ -48,13 +59,13 @@ The [`dbmdz/bert-base-historic-multilingual-cased`](https://huggingface.co/dbmdz
|
|
48 |
The models' performance measures are as follows:
|
49 |
|
50 |
| Model | Selected Epoch | Checkpoint | Validation Loss | Precision | Recall | F<sub>1</sub> | Accuracy |
|
51 |
-
|
52 |
-
| [`EVENT`](https://huggingface.co/LelViLamp/
|
53 |
-
| [`LOC`](https://huggingface.co/LelViLamp/
|
54 |
-
| [`MISC`](https://huggingface.co/LelViLamp/
|
55 |
-
| [`ORG`](https://huggingface.co/LelViLamp/
|
56 |
-
| [`PER`](https://huggingface.co/LelViLamp/
|
57 |
-
| [`TIME`](https://huggingface.co/LelViLamp/
|
58 |
|
59 |
## Acknowledgements
|
60 |
The data set and models were created in the project _Kooperative Erschließung diffusen Wissens_ ([KEDiff](https://uni-salzburg.elsevierpure.com/de/projects/kooperative-erschließung-diffusen-wissens-ein-literaturwissenscha)), funded by the [State of Salzburg](https://salzburg.gv.at), Austria 🇦🇹, and carried out at [Paris Lodron Universität Salzburg](https://plus.ac.at).
|
|
|
17 |
|
18 |
## Annotations
|
19 |
|
20 |
+
Each text passage was annotated in [doccano](https://github.com/doccano/doccano) by two or three annotators and their annotations were cleaned and merged into one dataset. For details on how this was done, see [`LelViLamp/kediff-doccano-postprocessing`](https://github.com/LelViLamp/kediff-doccano-postprocessing). In total, the text consists of about 1.7m characters. The resulting annotation datasets were published on the Hugging Face Hub as [`oalz-1788-q1-ner-annotations`](https://huggingface.co/datasets/LelViLamp/oalz-1788-q1-ner-annotations).
|
21 |
+
|
22 |
+
There are two versions of the dataset
|
23 |
+
- [`5a-generate-union-dataset`](https://huggingface.co/datasets/LelViLamp/oalz-1788-q1-ner-annotations/tree/main/5a-generate-union-dataset) contains the texts split into chunks. This is how they were presented in the annotation application doccano
|
24 |
+
- [`5b-merge-documents`](https://huggingface.co/datasets/LelViLamp/oalz-1788-q1-ner-annotations/tree/main/5b-merge-documents) does not retain this split. The text was merged into one long text and annotation indices were adapted.
|
25 |
+
|
26 |
+
Note that both these directories contain three equivalent datasets each:
|
27 |
+
- a Huggingface/Arrow dataset, <sup>*</sup>
|
28 |
+
- a CSV, <sup>*</sup> and
|
29 |
+
- a JSONL file.
|
30 |
+
|
31 |
+
<sup>*</sup> The former two should be used together with `text.csv` to catch the context of the annotation. The latter JSONL file contains the full text.
|
32 |
|
33 |
The following categories were included in the annotation process:
|
34 |
|
35 |
| Tag | Label | Count | Total Length | Median Annotation Length | Mean Annotation Length | SD |
|
36 |
+
|:--------|:--------------|------:|-------------:|-------------------------:|-----------------------:|------:|
|
37 |
| `EVENT` | Event | 294 | 6,090 | 18 | 20.71 | 13.24 |
|
38 |
| `LOC` | Location | 2,449 | 24,417 | 9 | 9.97 | 6.21 |
|
39 |
| `MISC` | Miscellaneous | 2,585 | 50,654 | 14 | 19.60 | 19.63 |
|
|
|
59 |
The models' performance measures are as follows:
|
60 |
|
61 |
| Model | Selected Epoch | Checkpoint | Validation Loss | Precision | Recall | F<sub>1</sub> | Accuracy |
|
62 |
+
|:-------------------------------------------------------------------|:--------------:|-----------:|----------------:|----------:|--------:|--------------:|---------:|
|
63 |
+
| [`EVENT`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-event) | 1 | `1393` | .021957 | .665233 | .343066 | .351528 | .995700 |
|
64 |
+
| [`LOC`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-loc) | 1 | `1393` | .033602 | .829535 | .803648 | .814146 | .990999 |
|
65 |
+
| [`MISC`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-misc) | 2 | `2786` | .123994 | .739221 | .503677 | .571298 | 968697 |
|
66 |
+
| [`ORG`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-org) | 1 | `1393` | .062769 | .744259 | .709738 | .726212 | .980288 |
|
67 |
+
| [`PER`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-per) | 2 | `2786` | .059186 | .914037 | .849048 | .879070 | .983253 |
|
68 |
+
| [`TIME`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-time) | 1 | `1393` | .016120 | .866866 | .724958 | .783099 | .994631 |
|
69 |
|
70 |
## Acknowledgements
|
71 |
The data set and models were created in the project _Kooperative Erschließung diffusen Wissens_ ([KEDiff](https://uni-salzburg.elsevierpure.com/de/projects/kooperative-erschließung-diffusen-wissens-ein-literaturwissenscha)), funded by the [State of Salzburg](https://salzburg.gv.at), Austria 🇦🇹, and carried out at [Paris Lodron Universität Salzburg](https://plus.ac.at).
|