Iranian Azerbaijani NLP Models

Github Repository: iranian-azerbaijani-nlp

Overview

This model card provides information about the NLP models developed as part of the paper accepted for publication at AACL 2023. The models are designed to support Natural Language Processing (NLP) tasks for the Iranian Azerbaijani language (ISO code: azb). The models included in this repository are:

AzerBERT
- Type: BERT-based language model transformer
- Description: AzerBERT is a pre-trained language model specifically tailored for the Iranian Azerbaijani language. It can be used for various NLP tasks, including text classification, named entity recognition, and more.
- Model Link: AzerBERT Model
Language Model-based Embedding (FastText)
- Type: FastText-based word embedding model
- Description: This model provides embeddings for Iranian Azerbaijani text using the FastText framework. It allows you to generate word embeddings for Iranian Azerbaijani words and phrases.
- Model Link: FastText Embedding Model
Text Classification Model (Fine-tuned with AzerBERT)
- Type: Fine-tuned BERT-based text classification model
- Description: This model has been fine-tuned using AzerBERT for text classification tasks. It is designed to categorize text into one of the following four categories: literature, sports, history, and geography.
- Model Link: Text Classification Model
POS Tagger (Fine-tuned with AzerBERT)
- Type: Fine-tuned BERT-based Part-of-Speech (POS) tagging model
- Description: This model has been fine-tuned using AzerBERT for part-of-speech tagging tasks in Iranian Azerbaijani text. It can be used to annotate text with 11 POS tags, which is essential for various downstream NLP applications.
- Model Link: POS Tagger Model
Translation Models (Persian to Azerbaijani and Vice Versa)
- Type: Machine translation models
- Description: These models support translation between Persian (fa) and Iranian Azerbaijani (azb) languages. They enable bidirectional translation between the two languages, making them valuable for cross-language communication.
- Model Link: Translation Models

Model Training Data

The details about the training data used to pre-train and fine-tune these models can be found in the associated research paper. Please refer to the paper for comprehensive information about the data sources and preprocessing steps.

Model Performance Summary

The following table provides a summary of the model's performance on various tasks. Performance metrics are reported for each task.

Task	Model	Evaluation Metric	Performance
Language model-based Embedding	FastText	MRR	0.46
Language Model	BERT	Perplexity	48.05
Text Classification	TF-IDF + SVM	Accuracy	0.79
	TF-IDF + SVM	F1-score	0.78
	FastText + SVM	Accuracy	0.86
	FastText + SVM	F1-score	0.86
	BERT	Accuracy	0.89
	BERT	F1-score	0.89
Token Classification	BERT POS-tagger	Accuracy	0.86
	BERT POS-tagger	Macro F1-score	0.67
Machine Translation	Text Translation azb2fa	SacreBLEU	10.34
	Text Translation fa2azb	SacreBLEU	8.07

Acknowledgments

Please acknowledge the authors and cite the associated research paper when using these models in your work. Proper attribution helps recognize the effort and contributions of the researchers involved in model development.

Citation

If you use these models in your research or applications, please cite the following paper:

@inproceedings{azbpipeline,
    title = "The Language Model, Resources, and Computational Pipelines for the Under-Resourced Iranian Azerbaijani",
    author = "Marzia, Nouri and
                  Mahsa, Amani and
                  Reihaneh, Zohrabi and
                  Asgari, Ehsaneddin",
    booktitle = "Findings of the Association for Computational Linguistics: AACL-IJCNLP 2023",
    month = nov,
    year = "2023",
    address = "",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "",
    abstract = "",
}

Contact Information

For questions, issues, or inquiries related to these models, please contact inquiries[AT]language.ml, marziehnouri1999[AT]gmail.com, or mahsa.ama1391[AT]gmail.com