XLM-RoBERTa for Kabardian Part-of-Speech Tagging
Model description
This model is a fine-tuned version of panagoa/xlm-roberta-base-kbd on the panagoa/kbd-pos-tags dataset. It is designed to perform Part-of-Speech (POS) tagging for text in the Kabardian language (kbd).
The model identifies 17 different POS tags:
Tag | Description | Examples |
---|---|---|
ADJ | Adjective | хужь (white), къабзэ (clean) |
ADP | Adposition | щхьэкIэ (for), папщIэ (because of) |
ADV | Adverb | псынщIэу (quickly), жыжьэу (far) |
AUX | Auxiliary | хъунщ (will be), щытащ (was) |
CCONJ | Coordinating conjunction | икIи (and), ауэ (but) |
DET | Determiner | мо (that), мыпхуэдэ (this kind) |
INTJ | Interjection | уэлэхьи (by God), зиунагъуэрэ (oh my) |
NOUN | Noun | унэ (house), щIалэ (boy) |
NUM | Numeral | зы (one), тIу (two) |
PART | Particle | мы (this), а (that) |
PRON | Pronoun | сэ (I), уэ (you) |
PROPN | Proper noun | Мурат (Murat), Налшык (Nalchik) |
PUNCT | Punctuation | . (period), , (comma) |
SCONJ | Subordinating conjunction | щхьэкIэ (because), щыгъуэ (when) |
SYM | Symbol | % (percent), $ (dollar) |
VERB | Verb | мэкIуэ (goes), матхэ (writes) |
X | Other | - |
Intended Use
This model is intended for:
- Linguistic analysis of Kabardian text
- Natural language processing pipelines for Kabardian
- Research on low-resource languages
- Educational purposes for teaching Kabardian grammar
Training Data
The model was trained on the panagoa/kbd-pos-tags dataset, which contains 82,925 tagged sentences in Kabardian. The dataset shows the following tag distribution:
- VERB: 116,377 (30.0%)
- NOUN: 115,232 (29.7%)
- PRON: 63,827 (16.5%)
- ADV: 35,036 (9.0%)
- ADJ: 20,817 (5.4%)
- PROPN: 18,692 (4.8%)
- DET: 6,830 (1.8%)
- CCONJ: 6,098 (1.6%)
- ADP: 4,793 (1.2%)
- PUNCT: 4,752 (1.2%)
- NUM: 4,741 (1.2%)
- INTJ: 2,787 (0.7%)
- PART: 2,241 (0.6%)
- SCONJ: 1,206 (0.3%)
- AUX: 560 (0.1%)
- X: 273 (0.1%)
- SYM: 7 (<0.1%)
Training Procedure
The model was trained with the following configuration:
- Base model: panagoa/xlm-roberta-base-kbd
- Learning rate: 2e-5
- Batch size: 32
- Epochs: 3
- Weight decay: 0.01
- Class weights: Applied to handle class imbalance
- Maximum sequence length: 128
Class weights were calculated inversely proportional to the class frequencies to address the imbalance in the dataset, with rare tags given higher importance during training.
Evaluation Results
The model achieved the following performance on a validation set (20% of the data):
- Overall accuracy: ~85%
- Performance varies across different POS tags, with better results on common tags like NOUN and VERB.
Limitations
- The model may struggle with rare POS tags (like SYM) due to limited examples in the training data
- Performance may vary with dialectal variations or non-standard Kabardian text
- The model has a context window limitation of 128 tokens
- Some ambiguous words might be incorrectly tagged based on context
Usage Example
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("panagoa/xlm-roberta-base-kbd-pos-tagger")
# Define function for prediction
def predict_pos_tags(text, model, tokenizer):
# Split text into words if it's a string
if isinstance(text, str):
text = text.split()
# Determine device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
# Tokenize input text
encoded_input = tokenizer(
text,
truncation=True,
is_split_into_words=True,
return_tensors="pt"
)
# Move inputs to the same device
inputs = {k: v.to(device) for k, v in encoded_input.items()}
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Map to POS tags
word_ids = encoded_input.word_ids()
previous_word_idx = None
predicted_tags = []
for idx, word_idx in enumerate(word_ids):
if word_idx != previous_word_idx:
predicted_tags.append(model.config.id2label[predictions[0][idx].item()])
previous_word_idx = word_idx
return predicted_tags[:len(text)]
# Example usage
text = "Хъыджэбзыр щIэкIри фошыгъу къыхуихьащ"
words = text.split()
tags = predict_pos_tags(words, model, tokenizer)
# Print results
for word, tag in zip(words, tags):
print(f"{word}: {tag}")
Хъыджэбзыр: NOUN
щIэкIри: VERB
фошыгъу: NOUN
къыхуихьащ: VERB
Author
This model was trained by panagoa and contributed to the Hugging Face community to support NLP research and applications for the Kabardian language.
Citation
If you use this model in your research, please cite:
@misc{panagoa2025kabardianpos,
author = {Panagoa},
title = {XLM-RoBERTa for Kabardian Part-of-Speech Tagging},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/panagoa/xlm-roberta-base-kbd-pos-tagger}}
}
- Downloads last month
- 68
Model tree for panagoa/xlm-roberta-base-kbd-pos-tagger
Base model
FacebookAI/xlm-roberta-base