metadata

library_name: transformers
language:
  - uz
license: mit
base_model: FacebookAI/xlm-roberta-large
tags:
  - generated_from_trainer
datasets:
  - risqaliyevds/uzbek_ner
metrics:
  - precision
  - recall
  - f1
  - accuracy
model-index:
  - name: Uzbek NER model
    results: []

Uzbek NER model

This model is a fine-tuned version of FacebookAI/xlm-roberta-large on the Uzbek Ner dataset. It achieves the following results on the evaluation set:

Loss: 0.1754
Precision: 0.5848
Recall: 0.6313
F1: 0.6071
Accuracy: 0.9386

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_ratio: 0.08
num_epochs: 3
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.2474	0.4662	100	0.2283	0.4911	0.5164	0.5035	0.9284
0.2039	0.9324	200	0.1942	0.5495	0.5836	0.5661	0.9345
0.1949	1.3963	300	0.1855	0.5591	0.6348	0.5945	0.9359
0.19	1.8625	400	0.1800	0.5604	0.6279	0.5922	0.9361
0.1769	2.3263	500	0.1761	0.5806	0.6262	0.6025	0.9381
0.1765	2.7925	600	0.1754	0.5849	0.6311	0.6071	0.9386

Framework versions

Transformers 4.49.0
Pytorch 2.5.1+cu124
Datasets 3.3.2
Tokenizers 0.21.0

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
custom_id2label = { 0: "O", 1: "B-CARDINAL", 2: "I-CARDINAL", 3: "B-DATE", 4: "I-DATE", 5: "B-EVENT", 6: "I-EVENT", 7: "B-GPE", 8: "I-GPE", 9: "B-LOC", 10: "I-LOC", 11: "B-MONEY", 12: "I-MONEY", 13: "B-ORDINAL", 14: "B-ORG", 15: "I-ORG", 16: "B-PERCENT", 17: "I-PERCENT", 18: "B-PERSON", 19: "I-PERSON", 20: "B-TIME", 21: "I-TIME" }
custom_label2id = {v: k for k, v in custom_id2label.items()}
model_name = "mustafoyev202/roberta-uz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=23)
model.config.id2label = custom_id2label
model.config.label2id = custom_label2id
text = "Tesla kompaniyasi AQSHda joylashgan."
tokens = tokenizer(text.split(), return_tensors="pt", is_split_into_words=True)
with torch.no_grad(): logits = model(**tokens).logits
predicted_token_class_ids = logits.argmax(-1).squeeze().tolist()
word_ids = tokens.word_ids()
previous_word_id = None
word_predictions = {}
for i, word_id in enumerate(word_ids): if word_id is not None: label = custom_id2label[predicted_token_class_ids[i]] if word_id != previous_word_id: # New word word_predictions[word_id] = label previous_word_id = word_id
words = text.split() # Splitting for simplicity
final_predictions = [(word, word_predictions.get(i, "O")) for i, word in enumerate(words)]
print("Predictions:")
for word, label in final_predictions: print(f"{word}: {label}")
labels = torch.tensor([predicted_token_class_ids]).unsqueeze(0) # Adjust dimensions
loss = model(**tokens, labels=labels).loss
print("\nLoss:", round(loss.item(), 2))