File size: 3,069 Bytes
e1e2c2a 8f8814c bcf51c5 2be973e fd9675b bcf51c5 fd9675b 8f8814c bcf51c5 6aeda95 8f8814c bcf51c5 8f8814c bcf51c5 8f8814c bcf51c5 8f8814c bcf51c5 6aeda95 bcf51c5 2be973e e5fe4c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
datasets:
- oscar-corpus/OSCAR-2301
language:
- az
library_name: transformers
---
Roberta base model trained on Azerbaijani subset of OSCAR corpus as a part of [research](https://peerj.com/articles/cs-1974/) on application of text augentation for low-resource languages.
It was developed to enhance text classification tasks in Azerbaijani, a low-resource language in the NLP domain. The model was trained using the Azerbaijani subset of the OSCAR corpus and further fine-tuned on a labeled news dataset.
## Training Data
The model was pre-trained on the Azerbaijani subset of the OSCAR corpus, and fine-tuned on approximately 3 million sentences from Azertag News Agency covering diverse topics such as politics, economy, culture, sports, technology, and health.
## Citation
```bibtex
@article{ziyaden2024augmentation,
title = {Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages},
author = {Ziyaden, Atabay and Yelenov, Amir and Hajiyev, Fuad and Rustamov, Samir and Pak, Alexandr},
year = 2024,
journal = {PeerJ Computer Science},
doi = {10.7717/peerj-cs.1974},
url = {https://doi.org/10.7717/peerj-cs.1974}
}
```
## Usage
```python
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("iamdenay/roberta-azerbaijani")
model = AutoModelWithLMHead.from_pretrained("iamdenay/roberta-azerbaijani")
```
```python
from transformers import pipeline
model_mask = pipeline('fill-mask', model='iamdenay/roberta-azerbaijani')
model_mask("Le tweet <mask>.")
```
## Output
```python
[{'sequence': 'azərtac xəbər verir ki',
'score': 0.9791,
'token': 1053,
'token_str': 'verir'},
{'sequence': 'azərtac xəbər verib ki',
'score': 0.0044,
'token': 2313,
'token_str': 'verib'},
... ]
```
## Limitations
- Language Specificity: The model is trained exclusively on Azerbaijani and may not generalize well to other languages.
- Data Bias: The fine-tuning data is sourced from news articles, which may contain biases or specific journalistic styles.
- Agglutinative Language Challenges: Azerbaijani's agglutinative nature can lead to sparsity in the word space due to numerous morphological variations.
## Ethical Considerations
- Content Sensitivity: The dataset may include sensitive topics. Users should ensure compliance with ethical standards when deploying the model.
- Bias and Fairness: Be aware of potential biases in the training data that could affect model predictions.
## Config
```json
attention_probs_dropout_prob:0.1
bos_token_id:0
classifier_dropout:null
eos_token_id:2
gradient_checkpointing:false
hidden_act:"gelu"
hidden_dropout_prob:0.1
hidden_size:768
initializer_range:0.02
intermediate_size:3072
layer_norm_eps:1e-12
max_position_embeddings:514
model_type:"roberta"
num_attention_heads:12
num_hidden_layers:6
pad_token_id:1
position_embedding_type:"absolute"
torch_dtype:"float32"
transformers_version:"4.10.0"
type_vocab_size:1
use_cache:true
vocab_size:52000
``` |