Update README.md

f7a6e33 10 months ago

4.83 kB

	---
	license: lgpl-3.0
	base_model: sdadas/polish-roberta-base-v2
	tags:
	- generated_from_trainer
	datasets:
	- nkjp1m
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model-index:
	- name: polish-roberta-base-v2-cposes-tagging
	results:
	- task:
	name: Token Classification
	type: token-classification
	dataset:
	name: nkjp1m
	type: nkjp1m
	config: nkjp1m
	split: test
	args: nkjp1m
	metrics:
	- name: Precision
	type: precision
	value: 0.9913009231909743
	- name: Recall
	type: recall
	value: 0.9912435137138621
	- name: F1
	type: f1
	value: 0.9912722176212015
	- name: Accuracy
	type: accuracy
	value: 0.9889172310669364
	widget:
	- text: "Niosę dwa miedziane leje"
	- text: "Ale dzisiaj leje"
	language:
	- pl
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# polish-roberta-base-v2-cposes-tagging

	This model is a fine-tuned version of [sdadas/polish-roberta-base-v2](https://huggingface.co/sdadas/polish-roberta-base-v2) on the nkjp1m dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0458
	- Precision: 0.9913
	- Recall: 0.9912
	- F1: 0.9913
	- Accuracy: 0.9889

	You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

	## Usage

	```
	from transformers import pipeline

	nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-cposes-tagging")

	nlp("Ale dzisiaj leje")
	```

	## Model description

	This model is a coarse-part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2.
	It support 13 classes representing coarse part of speech):
	```
	{
	0: 'A',
	1: 'Adv',
	2: 'Comp',
	3: 'Conj',
	4: 'Dig',
	5: 'Interj',
	6: 'N',
	7: 'Num',
	8: 'Part',
	9: 'Prep',
	10: 'Punct',
	11: 'V',
	12: 'X'
	}
	```
	Tags meaning is the same as in nkjp1m dataset:

	\| Tag \| Description in English \| Description in Polish \| Example in Polish \|
	\|-------\|----------------------------------\|-----------------------------\|---------------------------\|
	\| A \| Adjective \| przymiotnik \| szybki \|
	\| Adv \| Adverb \| przysłówek \| szybko \|
	\| Comp \| Comparative / Complementizer \| stopień porównawczy / spójnik podrzędny \| lepszy / że \|
	\| Conj \| Conjunction \| spójnik \| i \|
	\| Dig \| Digit \| cyfra \| 5, 3 \|
	\| Interj\| Interjection \| wykrzyknik \| och! \|
	\| N \| Noun \| rzeczownik \| dom \|
	\| Num \| Numeral \| liczebnik \| jeden \|
	\| Part \| Particle \| partykuła \| by \|
	\| Prep \| Preposition \| przyimek \| w \|
	\| Punct \| Punctuation \| interpunkcja \| ., !, ? \|
	\| V \| Verb \| czasownik \| biegać \|
	\| X \| Unknown / Other \| niesklasyfikowane \| xxx \|

	## Intended uses & limitations

	Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

	## Training and evaluation data

	Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:---------:\|:------:\|:------:\|:--------:\|
	\| 0.0471 \| 1.0 \| 2155 \| 0.0491 \| 0.9896 \| 0.9900 \| 0.9898 \| 0.9873 \|
	\| 0.0291 \| 2.0 \| 4310 \| 0.0467 \| 0.9901 \| 0.9905 \| 0.9903 \| 0.9884 \|
	\| 0.0191 \| 3.0 \| 6465 \| 0.0458 \| 0.9913 \| 0.9912 \| 0.9913 \| 0.9889 \|


	### Framework versions

	- Transformers 4.35.2
	- Pytorch 2.1.0+cu118
	- Datasets 2.15.0
	- Tokenizers 0.15.0