wkaminski's picture
Update README.md
f7a6e33
metadata
license: lgpl-3.0
base_model: sdadas/polish-roberta-base-v2
tags:
  - generated_from_trainer
datasets:
  - nkjp1m
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: polish-roberta-base-v2-cposes-tagging
    results:
      - task:
          name: Token Classification
          type: token-classification
        dataset:
          name: nkjp1m
          type: nkjp1m
          config: nkjp1m
          split: test
          args: nkjp1m
        metrics:
          - name: Precision
            type: precision
            value: 0.9913009231909743
          - name: Recall
            type: recall
            value: 0.9912435137138621
          - name: F1
            type: f1
            value: 0.9912722176212015
          - name: Accuracy
            type: accuracy
            value: 0.9889172310669364
widget:
  - text: Niosę dwa miedziane leje
  - text: Ale dzisiaj leje
language:
  - pl

polish-roberta-base-v2-cposes-tagging

This model is a fine-tuned version of sdadas/polish-roberta-base-v2 on the nkjp1m dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0458
  • Precision: 0.9913
  • Recall: 0.9912
  • F1: 0.9913
  • Accuracy: 0.9889

You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

Usage

from transformers import pipeline

nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-cposes-tagging")

nlp("Ale dzisiaj leje")

Model description

This model is a coarse-part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2. It support 13 classes representing coarse part of speech):

{
 0: 'A',
 1: 'Adv',
 2: 'Comp',
 3: 'Conj',
 4: 'Dig',
 5: 'Interj',
 6: 'N',
 7: 'Num',
 8: 'Part',
 9: 'Prep',
 10: 'Punct',
 11: 'V',
 12: 'X'
}

Tags meaning is the same as in nkjp1m dataset:

Tag Description in English Description in Polish Example in Polish
A Adjective przymiotnik szybki
Adv Adverb przysłówek szybko
Comp Comparative / Complementizer stopień porównawczy / spójnik podrzędny lepszy / że
Conj Conjunction spójnik i
Dig Digit cyfra 5, 3
Interj Interjection wykrzyknik och!
N Noun rzeczownik dom
Num Numeral liczebnik jeden
Part Particle partykuła by
Prep Preposition przyimek w
Punct Punctuation interpunkcja ., !, ?
V Verb czasownik biegać
X Unknown / Other niesklasyfikowane xxx

Intended uses & limitations

Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

Training and evaluation data

Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.0471 1.0 2155 0.0491 0.9896 0.9900 0.9898 0.9873
0.0291 2.0 4310 0.0467 0.9901 0.9905 0.9903 0.9884
0.0191 3.0 6465 0.0458 0.9913 0.9912 0.9913 0.9889

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu118
  • Datasets 2.15.0
  • Tokenizers 0.15.0