|
## Czech PDT-C 1.0 Model #czech_pdtc1.0_model |
|
|
|
PDT-C 1.0 Model is distributed under the |
|
[CC BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/) licence. |
|
The model is trained on [PDT-C 1.0 treebank](https://hdl.handle.net/11234/1-3185) |
|
using [RobeCzech model](https://hdl.handle.net/11234/1-3691), and performs |
|
morphological analysis using the [MorfFlex CZ 2.0](https://hdl.handle.net/11234/1-4794) |
|
morphological dictionary via [MorphoDiTa](https://ufal.mff.cuni.cz/morphodita). |
|
|
|
The model requires [UDPipe 2.1](https://ufal.mff.cuni.cz/udpipe/2), together |
|
with Python packages [ufal.udpipe](https://pypi.org/project/ufal.udpipe/) |
|
version at least 1.3.1.1 and [ufal.morphodita](https://pypi.org/project/ufal.morphodita/) |
|
version at least 1.11.2.1. |
|
|
|
### Download |
|
|
|
The latest version 231116 of the Czech PDT-C 1.0 model |
|
can be downloaded from the [LINDAT/CLARIN repository](http://hdl.handle.net/11234/1-5293). |
|
|
|
The model is also available in the [REST service](https://lindat.mff.cuni.cz/services/udpipe/). |
|
|
|
### PDT-C 1.0 Morphological System |
|
|
|
PDT-C 1.0 uses the _PDT-C tag set_ from MorfFlex CZ 2.0, which is an evolution |
|
of the original _PDT tag set_ devised by Jan Hajič |
|
([Hajič, 2004](https://books.google.cz/books?id=sB63AAAACAAJ)). |
|
The tags are positional with 15 positions corresponding to part of speech, |
|
detailed part of speech, gender, number, case, etc. (e.g. `NNFS1-----A----`). |
|
Different meanings of same lemmas are distinguished and additional comments can |
|
be provided for every lemma meaning. The complete reference can be found in the |
|
[Manual for Morphological Annotation, Revision for the Prague Dependency |
|
Treebank - Consolidated 2020 release](https://ufal.mff.cuni.cz/techrep/tr64.pdf) |
|
and quick reference is available in the [PDT-C positional morphological tags |
|
overview](https://ufal.mff.cuni.cz/pdt-c/publications/Appendix_M_Tags_2020.pdf). |
|
|
|
The PDT-C 1.0 emply dependency relations from the [PDT analytical |
|
level](https://ufal.mff.cuni.cz/pdt-c/publications/PDT20-a-man-en.pdf), with |
|
a quick reference available in the [PDT-C analytical functions and clause |
|
segmentation overview](http://ufal.mff.cuni.cz/pdt-c/publications/Appendix_A_Tags_2020.pdf). |
|
|
|
In the CoNLL-U format, the |
|
- tags are filled in the `XPOS` column, and |
|
- the dependency relations are filled in the `DEPREL`, even if they are |
|
different from the universal dependency relations. |
|
|
|
### PDT-C 1.0 Train/Dev/Test Splits |
|
|
|
The PDT-C corpus consists of four datasets, but some of them do not have |
|
an official train/dev/test split. We therefore used the following split: |
|
|
|
- PDT dataset is already split into train, dev (`dtest`), and test (`etest`). |
|
- PCEDT dataset is a translated version of the Wall Street Journal, so we used |
|
the usual split into train (sections 0-18), dev (sections 19-21), and test |
|
(sections 22-24). |
|
- PDTSC and FAUST datasets have no split, so we split it into dev (documents |
|
with identifiers ending with 6), test (documents with identifiers ending with 7), |
|
and train (all the remaining documents). |
|
|
|
### Acknowledgements #czech_pdtc1.0_model_acknowledgements |
|
|
|
This work has been supported by the LINDAT/CLARIAH-CZ project funded by Ministry |
|
of Education, Youth and Sports of the Czech Republic (project LM2023062). |
|
|
|
### Publications |
|
|
|
- Milan Straka, Jakub Náplava, Jana Straková, David Samuel (2020): [RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model](https://doi.org/10.1007/978-3-030-83527-9_17). In: Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, vol 12848. Springer, Cham. |
|
- Jan Hajič, Eduard Bejček, Jaroslava Hlavacova, Marie Mikulová, Milan Straka, Jan Štěpánek, and Barbora Štěpánková (2020). [Prague Dependency Treebank - Consolidated 1.0](https://aclanthology.org/2020.lrec-1.641.pdf). In: Proceedings of the 12th Language Resources and Evaluation Conference, pages 5208–5218, Marseille, France. European Language Resources Association. |
|
- Milan Straka (2018): [UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task](https://www.aclweb.org/anthology/K18-2020/). In: Proceedings of CoNLL 2018: The SIGNLL Conference on Computational Natural Language Learning, pp. 197-207, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-72-8 |
|
- Straková Jana, Straka Milan and Hajič Jan (2014): [Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition](https://aclanthology.org/P14-5003/). In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics. |
|
|
|
### Model Performance |
|
|
|
#### Tagging and Lemmatization |
|
|
|
We evaluate tagging and lemmatization on the four datasets of PDT-C 1.0, |
|
and we also compute a macro-average. For lemmatization, we use the following |
|
metrics: |
|
- `Lemmas`: a primary metric comparing the _lemma proper_, which is the lemma |
|
with an optional lemma number (but we ignore the additional lemma comments |
|
like “this is a given name”); |
|
- `LemmasEM`: an exact match comparing also the lemma comments. This metric is |
|
less or equal to `Lemmas`. Our model directly predicts only lemma proper (no |
|
additional comments), and relies on the morphological dictionary to supply the |
|
comments, so it fails to generate comments for unknown words (like an unknown |
|
given name). |
|
|
|
We perform the evaluation using the |
|
[udpipe2_eval.py](https://github.com/ufal/udpipe/blob/udpipe-2/udpipe2_eval.py), |
|
which is a minor extension of the [CoNLL 2018 Shared |
|
Task](https://universaldependencies.org/conll18/evaluation.html) evaluation |
|
script. |
|
|
|
Because the model also include a rule-based tokenizer and sentence splitter, |
|
we evaluate both: |
|
- using raw input text, which must first be tokenized and split into sentences. |
|
The resulting scores are in fact F1-scores. Note that the FAUST dataset does |
|
not contain any discernible sentence boundaries. |
|
- using gold tokenization. |
|
|
|
| Treebank | Mode | Tokens | Sents | XPOS | Lemma | LemmaEM | |
|
|:---------|:------------------|-------:|------:|------:|------:|--------:| |
|
| PDT | Raw text | 99.91 | 88.00 | 98.69 | 99.10 | 98.86 | |
|
| PDT | Gold tokenization | — | — | 98.78 | 99.19 | 98.96 | |
|
| PCEDT | Raw text | 99.97 | 94.06 | 98.77 | 99.36 | 98.75 | |
|
| PCEDT | Gold tokenization | — | — | 98.80 | 99.40 | 98.78 | |
|
| PDTSC | Raw text | 100.0 | 98.31 | 98.77 | 99.23 | 99.16 | |
|
| PDTSC | Gold tokenization | — | — | 98.77 | 99.23 | 99.16 | |
|
| FAUST | Raw text | 100.0 | 10.98 | 97.05 | 98.88 | 98.43 | |
|
| FAUST | Gold tokenization | — | — | 97.42 | 98.78 | 98.30 | |
|
| MacroAvg | Gold tokenization | — | — | 98.44 | 99.15 | 98.80 | |
|
|
|
#### Dependency Parsing |
|
|
|
In PDT-C 1.0, the only manually annotated dependency parsing dataset is a subset |
|
of the PDT dataset. We perform the evaluation as in the previous section. |
|
|
|
| Treebank | Mode | Tokens | Sents | XPOS | Lemma | LemmaEM | UAS | LAS | |
|
|:-----------|:------------------|-------:|------:|------:|------:|--------:|------:|------:| |
|
| PDT subset | Raw text | 99.94 | 88.49 | 98.74 | 99.16 | 98.97 | 93.45 | 90.32 | |
|
| PDT subset | Gold tokenization | — | — | 98.81 | 99.23 | 99.03 | 94.41 | 91.48 | |
|
|