XLM-RoBERTa for Kabardian Part-of-Speech Tagging

Model description

This model is a fine-tuned version of panagoa/xlm-roberta-base-kbd on the panagoa/kbd-pos-tags dataset. It is designed to perform Part-of-Speech (POS) tagging for text in the Kabardian language (kbd).

The model identifies 17 different POS tags:

Tag	Description	Examples
ADJ	Adjective	хужь (white), къабзэ (clean)
ADP	Adposition	щхьэкIэ (for), папщIэ (because of)
ADV	Adverb	псынщIэу (quickly), жыжьэу (far)
AUX	Auxiliary	хъунщ (will be), щытащ (was)
CCONJ	Coordinating conjunction	икIи (and), ауэ (but)
DET	Determiner	мо (that), мыпхуэдэ (this kind)
INTJ	Interjection	уэлэхьи (by God), зиунагъуэрэ (oh my)
NOUN	Noun	унэ (house), щIалэ (boy)
NUM	Numeral	зы (one), тIу (two)
PART	Particle	мы (this), а (that)
PRON	Pronoun	сэ (I), уэ (you)
PROPN	Proper noun	Мурат (Murat), Налшык (Nalchik)
PUNCT	Punctuation	. (period), , (comma)
SCONJ	Subordinating conjunction	щхьэкIэ (because), щыгъуэ (when)
SYM	Symbol	% (percent), $ (dollar)
VERB	Verb	мэкIуэ (goes), матхэ (writes)
X	Other	-

Intended Use

This model is intended for:

Linguistic analysis of Kabardian text
Natural language processing pipelines for Kabardian
Research on low-resource languages
Educational purposes for teaching Kabardian grammar

Training Data

The model was trained on the panagoa/kbd-pos-tags dataset, which contains 82,925 tagged sentences in Kabardian. The dataset shows the following tag distribution:

VERB: 116,377 (30.0%)
NOUN: 115,232 (29.7%)
PRON: 63,827 (16.5%)
ADV: 35,036 (9.0%)
ADJ: 20,817 (5.4%)
PROPN: 18,692 (4.8%)
DET: 6,830 (1.8%)
CCONJ: 6,098 (1.6%)
ADP: 4,793 (1.2%)
PUNCT: 4,752 (1.2%)
NUM: 4,741 (1.2%)
INTJ: 2,787 (0.7%)
PART: 2,241 (0.6%)
SCONJ: 1,206 (0.3%)
AUX: 560 (0.1%)
X: 273 (0.1%)
SYM: 7 (<0.1%)

Training Procedure

The model was trained with the following configuration:

Base model: panagoa/xlm-roberta-base-kbd
Learning rate: 2e-5
Batch size: 32
Epochs: 3
Weight decay: 0.01
Class weights: Applied to handle class imbalance
Maximum sequence length: 128

Class weights were calculated inversely proportional to the class frequencies to address the imbalance in the dataset, with rare tags given higher importance during training.

Evaluation Results

The model achieved the following performance on a validation set (20% of the data):

Overall accuracy: ~85%
Performance varies across different POS tags, with better results on common tags like NOUN and VERB.

Limitations

The model may struggle with rare POS tags (like SYM) due to limited examples in the training data
Performance may vary with dialectal variations or non-standard Kabardian text
The model has a context window limitation of 128 tokens
Some ambiguous words might be incorrectly tagged based on context

Usage Example

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")

# Define function for prediction
def predict_pos_tags(text, model, tokenizer):
    # Split text into words if it's a string
    if isinstance(text, str):
        text = text.split()
        
    # Determine device
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = model.to(device)
    
    # Tokenize input text
    encoded_input = tokenizer(
        text,
        truncation=True,
        is_split_into_words=True,
        return_tensors="pt"
    )
    
    # Move inputs to the same device
    inputs = {k: v.to(device) for k, v in encoded_input.items()}
    
    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=2)
    
    # Map to POS tags
    word_ids = encoded_input.word_ids()
    previous_word_idx = None
    predicted_tags = []
    
    for idx, word_idx in enumerate(word_ids):
        if word_idx != previous_word_idx:
            predicted_tags.append(model.config.id2label[predictions[0][idx].item()])
        previous_word_idx = word_idx
    
    return predicted_tags[:len(text)]

# Example usage
text = "Хъыджэбзыр щIэкIри фошыгъу къыхуихьащ"
words = text.split()
tags = predict_pos_tags(words, model, tokenizer)

# Print results
for word, tag in zip(words, tags):
    print(f"{word}: {tag}")

Хъыджэбзыр: NOUN
щIэкIри: VERB
фошыгъу: NOUN
къыхуихьащ: VERB

Author

This model was trained by panagoa and contributed to the Hugging Face community to support NLP research and applications for the Kabardian language.

Citation

If you use this model in your research, please cite:

@misc{panagoa2025kabardianpos,
  author = {Panagoa},
  title = {XLM-RoBERTa for Kabardian Part-of-Speech Tagging},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/panagoa/xlm-roberta-base-kbd-pos-tagger}}
}

panagoa
/

xlm-roberta-base-kbd-pos-tagger