Spaces:

anasampa2
/

parser

Runtime error

App Files Files Community

parser / udpipe2 /docs /models_pdtc10.md

anasampa2

Upload 151 files

ee0ec3d verified 12 months ago

preview code

raw

history blame

7.4 kB

	## Czech PDT-C 1.0 Model #czech_pdtc1.0_model

	PDT-C 1.0 Model is distributed under the
	[CC BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/) licence.
	The model is trained on [PDT-C 1.0 treebank](https://hdl.handle.net/11234/1-3185)
	using [RobeCzech model](https://hdl.handle.net/11234/1-3691), and performs
	morphological analysis using the [MorfFlex CZ 2.0](https://hdl.handle.net/11234/1-4794)
	morphological dictionary via [MorphoDiTa](https://ufal.mff.cuni.cz/morphodita).

	The model requires [UDPipe 2.1](https://ufal.mff.cuni.cz/udpipe/2), together
	with Python packages [ufal.udpipe](https://pypi.org/project/ufal.udpipe/)
	version at least 1.3.1.1 and [ufal.morphodita](https://pypi.org/project/ufal.morphodita/)
	version at least 1.11.2.1.

	### Download

	The latest version 231116 of the Czech PDT-C 1.0 model
	can be downloaded from the [LINDAT/CLARIN repository](http://hdl.handle.net/11234/1-5293).

	The model is also available in the [REST service](https://lindat.mff.cuni.cz/services/udpipe/).

	### PDT-C 1.0 Morphological System

	PDT-C 1.0 uses the _PDT-C tag set_ from MorfFlex CZ 2.0, which is an evolution
	of the original _PDT tag set_ devised by Jan Hajič
	([Hajič, 2004](https://books.google.cz/books?id=sB63AAAACAAJ)).
	The tags are positional with 15 positions corresponding to part of speech,
	detailed part of speech, gender, number, case, etc. (e.g. `NNFS1-----A----`).
	Different meanings of same lemmas are distinguished and additional comments can
	be provided for every lemma meaning. The complete reference can be found in the
	[Manual for Morphological Annotation, Revision for the Prague Dependency
	Treebank - Consolidated 2020 release](https://ufal.mff.cuni.cz/techrep/tr64.pdf)
	and quick reference is available in the [PDT-C positional morphological tags
	overview](https://ufal.mff.cuni.cz/pdt-c/publications/Appendix_M_Tags_2020.pdf).

	The PDT-C 1.0 emply dependency relations from the [PDT analytical
	level](https://ufal.mff.cuni.cz/pdt-c/publications/PDT20-a-man-en.pdf), with
	a quick reference available in the [PDT-C analytical functions and clause
	segmentation overview](http://ufal.mff.cuni.cz/pdt-c/publications/Appendix_A_Tags_2020.pdf).

	In the CoNLL-U format, the
	- tags are filled in the `XPOS` column, and
	- the dependency relations are filled in the `DEPREL`, even if they are
	different from the universal dependency relations.

	### PDT-C 1.0 Train/Dev/Test Splits

	The PDT-C corpus consists of four datasets, but some of them do not have
	an official train/dev/test split. We therefore used the following split:

	- PDT dataset is already split into train, dev (`dtest`), and test (`etest`).
	- PCEDT dataset is a translated version of the Wall Street Journal, so we used
	the usual split into train (sections 0-18), dev (sections 19-21), and test
	(sections 22-24).
	- PDTSC and FAUST datasets have no split, so we split it into dev (documents
	with identifiers ending with 6), test (documents with identifiers ending with 7),
	and train (all the remaining documents).

	### Acknowledgements #czech_pdtc1.0_model_acknowledgements

	This work has been supported by the LINDAT/CLARIAH-CZ project funded by Ministry
	of Education, Youth and Sports of the Czech Republic (project LM2023062).

	### Publications

	- Milan Straka, Jakub Náplava, Jana Straková, David Samuel (2020): [RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model](https://doi.org/10.1007/978-3-030-83527-9_17). In: Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, vol 12848. Springer, Cham.
	- Jan Hajič, Eduard Bejček, Jaroslava Hlavacova, Marie Mikulová, Milan Straka, Jan Štěpánek, and Barbora Štěpánková (2020). [Prague Dependency Treebank - Consolidated 1.0](https://aclanthology.org/2020.lrec-1.641.pdf). In: Proceedings of the 12th Language Resources and Evaluation Conference, pages 5208–5218, Marseille, France. European Language Resources Association.
	- Milan Straka (2018): [UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task](https://www.aclweb.org/anthology/K18-2020/). In: Proceedings of CoNLL 2018: The SIGNLL Conference on Computational Natural Language Learning, pp. 197-207, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-72-8
	- Straková Jana, Straka Milan and Hajič Jan (2014): [Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition](https://aclanthology.org/P14-5003/). In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.

	### Model Performance

	#### Tagging and Lemmatization

	We evaluate tagging and lemmatization on the four datasets of PDT-C 1.0,
	and we also compute a macro-average. For lemmatization, we use the following
	metrics:
	- `Lemmas`: a primary metric comparing the _lemma proper_, which is the lemma
	with an optional lemma number (but we ignore the additional lemma comments
	like “this is a given name”);
	- `LemmasEM`: an exact match comparing also the lemma comments. This metric is
	less or equal to `Lemmas`. Our model directly predicts only lemma proper (no
	additional comments), and relies on the morphological dictionary to supply the
	comments, so it fails to generate comments for unknown words (like an unknown
	given name).

	We perform the evaluation using the
	[udpipe2_eval.py](https://github.com/ufal/udpipe/blob/udpipe-2/udpipe2_eval.py),
	which is a minor extension of the [CoNLL 2018 Shared
	Task](https://universaldependencies.org/conll18/evaluation.html) evaluation
	script.

	Because the model also include a rule-based tokenizer and sentence splitter,
	we evaluate both:
	- using raw input text, which must first be tokenized and split into sentences.
	The resulting scores are in fact F1-scores. Note that the FAUST dataset does
	not contain any discernible sentence boundaries.
	- using gold tokenization.

	\| Treebank \| Mode \| Tokens \| Sents \| XPOS \| Lemma \| LemmaEM \|
	\|:---------\|:------------------\|-------:\|------:\|------:\|------:\|--------:\|
	\| PDT \| Raw text \| 99.91 \| 88.00 \| 98.69 \| 99.10 \| 98.86 \|
	\| PDT \| Gold tokenization \| — \| — \| 98.78 \| 99.19 \| 98.96 \|
	\| PCEDT \| Raw text \| 99.97 \| 94.06 \| 98.77 \| 99.36 \| 98.75 \|
	\| PCEDT \| Gold tokenization \| — \| — \| 98.80 \| 99.40 \| 98.78 \|
	\| PDTSC \| Raw text \| 100.0 \| 98.31 \| 98.77 \| 99.23 \| 99.16 \|
	\| PDTSC \| Gold tokenization \| — \| — \| 98.77 \| 99.23 \| 99.16 \|
	\| FAUST \| Raw text \| 100.0 \| 10.98 \| 97.05 \| 98.88 \| 98.43 \|
	\| FAUST \| Gold tokenization \| — \| — \| 97.42 \| 98.78 \| 98.30 \|
	\| MacroAvg \| Gold tokenization \| — \| — \| 98.44 \| 99.15 \| 98.80 \|

	#### Dependency Parsing

	In PDT-C 1.0, the only manually annotated dependency parsing dataset is a subset
	of the PDT dataset. We perform the evaluation as in the previous section.

	\| Treebank \| Mode \| Tokens \| Sents \| XPOS \| Lemma \| LemmaEM \| UAS \| LAS \|
	\|:-----------\|:------------------\|-------:\|------:\|------:\|------:\|--------:\|------:\|------:\|
	\| PDT subset \| Raw text \| 99.94 \| 88.49 \| 98.74 \| 99.16 \| 98.97 \| 93.45 \| 90.32 \|
	\| PDT subset \| Gold tokenization \| — \| — \| 98.81 \| 99.23 \| 99.03 \| 94.41 \| 91.48 \|