Iranian Azerbaijani NLP Models
Github Repository: iranian-azerbaijani-nlp
Overview
This model card provides information about the NLP models developed as part of the paper accepted for publication at AACL 2023. The models are designed to support Natural Language Processing (NLP) tasks for the Iranian Azerbaijani language (ISO code: azb). The models included in this repository are:
AzerBERT
- Type: BERT-based language model transformer
- Description: AzerBERT is a pre-trained language model specifically tailored for the Iranian Azerbaijani language. It can be used for various NLP tasks, including text classification, named entity recognition, and more.
- Model Link: AzerBERT Model
Language Model-based Embedding (FastText)
- Type: FastText-based word embedding model
- Description: This model provides embeddings for Iranian Azerbaijani text using the FastText framework. It allows you to generate word embeddings for Iranian Azerbaijani words and phrases.
- Model Link: FastText Embedding Model
Text Classification Model (Fine-tuned with AzerBERT)
- Type: Fine-tuned BERT-based text classification model
- Description: This model has been fine-tuned using AzerBERT for text classification tasks. It is designed to categorize text into one of the following four categories: literature, sports, history, and geography.
- Model Link: Text Classification Model
POS Tagger (Fine-tuned with AzerBERT)
- Type: Fine-tuned BERT-based Part-of-Speech (POS) tagging model
- Description: This model has been fine-tuned using AzerBERT for part-of-speech tagging tasks in Iranian Azerbaijani text. It can be used to annotate text with 11 POS tags, which is essential for various downstream NLP applications.
- Model Link: POS Tagger Model
Translation Models (Persian to Azerbaijani and Vice Versa)
- Type: Machine translation models
- Description: These models support translation between Persian (fa) and Iranian Azerbaijani (azb) languages. They enable bidirectional translation between the two languages, making them valuable for cross-language communication.
- Model Link: Translation Models
Model Training Data
The details about the training data used to pre-train and fine-tune these models can be found in the associated research paper. Please refer to the paper for comprehensive information about the data sources and preprocessing steps.
Model Performance Summary
The following table provides a summary of the model's performance on various tasks. Performance metrics are reported for each task.
Task | Model | Evaluation Metric | Performance |
---|---|---|---|
Language model-based Embedding | FastText | MRR | 0.46 |
Language Model | BERT | Perplexity | 48.05 |
Text Classification | TF-IDF + SVM | Accuracy | 0.79 |
TF-IDF + SVM | F1-score | 0.78 | |
FastText + SVM | Accuracy | 0.86 | |
FastText + SVM | F1-score | 0.86 | |
BERT | Accuracy | 0.89 | |
BERT | F1-score | 0.89 | |
Token Classification | BERT POS-tagger | Accuracy | 0.86 |
BERT POS-tagger | Macro F1-score | 0.67 | |
Machine Translation | Text Translation azb2fa | SacreBLEU | 10.34 |
Text Translation fa2azb | SacreBLEU | 8.07 |
Acknowledgments
Please acknowledge the authors and cite the associated research paper when using these models in your work. Proper attribution helps recognize the effort and contributions of the researchers involved in model development.
Citation
If you use these models in your research or applications, please cite the following paper:
@inproceedings{azbpipeline,
title = "The Language Model, Resources, and Computational Pipelines for the Under-Resourced Iranian Azerbaijani",
author = "Marzia, Nouri and
Mahsa, Amani and
Reihaneh, Zohrabi and
Asgari, Ehsaneddin",
booktitle = "Findings of the Association for Computational Linguistics: AACL-IJCNLP 2023",
month = nov,
year = "2023",
address = "",
publisher = "Association for Computational Linguistics",
url = "",
pages = "",
abstract = "",
}
Contact Information
For questions, issues, or inquiries related to these models, please contact inquiries[AT]language.ml, marziehnouri1999[AT]gmail.com, or mahsa.ama1391[AT]gmail.com