Roberta base model trained on Azerbaijani subset of OSCAR corpus as a part of research on application of text augentation for low-resource languages. It was developed to enhance text classification tasks in Azerbaijani, a low-resource language in the NLP domain. The model was trained using the Azerbaijani subset of the OSCAR corpus and further fine-tuned on a labeled news dataset.

Training Data

The model was pre-trained on the Azerbaijani subset of the OSCAR corpus, and fine-tuned on approximately 3 million sentences from Azertag News Agency covering diverse topics such as politics, economy, culture, sports, technology, and health.

Citation

@article{ziyaden2024augmentation,
    title        = {Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages},
    author       = {Ziyaden, Atabay and Yelenov, Amir and Hajiyev, Fuad and Rustamov, Samir and Pak, Alexandr},
    year         = 2024,
    journal      = {PeerJ Computer Science},
    doi          = {10.7717/peerj-cs.1974},
    url          = {https://doi.org/10.7717/peerj-cs.1974}
}

Usage

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("iamdenay/roberta-azerbaijani")

model = AutoModelWithLMHead.from_pretrained("iamdenay/roberta-azerbaijani")
from transformers import pipeline
model_mask = pipeline('fill-mask', model='iamdenay/roberta-azerbaijani')
model_mask("Le tweet <mask>.")

Output

[{'sequence': 'azərtac xəbər verir ki',
  'score': 0.9791,
  'token': 1053,
  'token_str': 'verir'},
 {'sequence': 'azərtac xəbər verib ki',
  'score': 0.0044,
  'token': 2313,
  'token_str': 'verib'},
 ... ]

Limitations

  • Language Specificity: The model is trained exclusively on Azerbaijani and may not generalize well to other languages.
  • Data Bias: The fine-tuning data is sourced from news articles, which may contain biases or specific journalistic styles.
  • Agglutinative Language Challenges: Azerbaijani's agglutinative nature can lead to sparsity in the word space due to numerous morphological variations.

Ethical Considerations

  • Content Sensitivity: The dataset may include sensitive topics. Users should ensure compliance with ethical standards when deploying the model.
  • Bias and Fairness: Be aware of potential biases in the training data that could affect model predictions.

Config

attention_probs_dropout_prob:0.1
bos_token_id:0
classifier_dropout:null
eos_token_id:2
gradient_checkpointing:false
hidden_act:"gelu"
hidden_dropout_prob:0.1
hidden_size:768
initializer_range:0.02
intermediate_size:3072
layer_norm_eps:1e-12
max_position_embeddings:514
model_type:"roberta"
num_attention_heads:12
num_hidden_layers:6
pad_token_id:1
position_embedding_type:"absolute"
torch_dtype:"float32"
transformers_version:"4.10.0"
type_vocab_size:1
use_cache:true
vocab_size:52000
Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train iamdenay/roberta-azerbaijani