---
datasets:
- SKNahin/bengali-transliteration-data
base_model:
- facebook/mbart-large-50-many-to-many-mmt
tags:
- nlp
- seq2seq
---

# Model Card for Banglish to Bengali Transliteration using mBART

This model is designed to perform transliteration from Banglish (Romanized Bengali) to Bengali script using the [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) model. The training was conducted using the dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data).

The notebook used for training can be found here: [Kaggle Notebook](https://www.kaggle.com/code/shadabtanjeed/mbart-banglish-to-bengali-transliteration).

## Model Details

### Model Description

- **Developed by:** Shadab Tanjeed  
- **Model type:** Sequence-to-sequence (Seq2Seq) Transformer model  
- **Language(s) (NLP):** Bengali, Banglish (Romanized Bengali)  
- **Finetuned from model:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)  

### Model Sources

- **Repository:** [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt)  

## Uses

### Direct Use

The model is intended for direct transliteration of Banglish text to Bengali script.

### Downstream Use

It can be integrated into NLP applications where transliteration from Banglish to Bengali is required, such as chatbots, text normalization, and digital content processing.

### Out-of-Scope Use

The model is not designed for language translation beyond transliteration, and it may not perform well on text containing mixed languages or code-switching.

## Bias, Risks, and Limitations

- The model may struggle with ambiguous words that have multiple possible transliterations.  
- It may not perform well on informal or highly stylized text.  
- Limited dataset coverage could lead to errors in transliterating uncommon words.  

### Recommendations

Users should validate outputs, especially for critical applications, and consider further fine-tuning if necessary.

## How to Get Started with the Model

```python
from transformers import MBartForConditionalGeneration, MBartTokenizer

model_name = "facebook/mbart-large-50-many-to-many-mmt"
tokenizer = MBartTokenizer.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

text = "ami tomake bhalobashi"
inputs = tokenizer(text, return_tensors="pt")

translated_tokens = model.generate(**inputs)
output = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print(output)  # Expected Bengali transliteration

```

## Training Details

### Training Data

The dataset used for training is [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data), which contains pairs of Banglish (Romanized Bengali) and corresponding Bengali script.

### Training Procedure

#### Preprocessing

- Tokenization was performed using the mBART tokenizer.  
- Text normalization techniques were applied to remove noise.  

#### Training Hyperparameters

- **Batch size:** 8
- **Learning rate:** 3e-5
- **Epochs:** 5 

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- The same dataset [SKNahin/bengali-transliteration-data](https://huggingface.co/datasets/SKNahin/bengali-transliteration-data) was used for evaluation.  


## Technical Specifications

### Model Architecture and Objective

The model follows the Transformer-based Seq2Seq architecture from mBART.


#### Software

- **Framework:** Hugging Face Transformers

## Citation

If you use this model, please cite the dataset and base model:

```bibtex
@inproceedings{SKNahin2023,
  author = {SK Nahin},
  title = {Bengali Transliteration Dataset},
  year = {2023},
  publisher = {Hugging Face Datasets},
  url = {https://huggingface.co/datasets/SKNahin/bengali-transliteration-data}
}

@article{lewis2020mbart,
  title={mBART: Multilingual Denoising Pre-training for Neural Machine Translation},
  author={Lewis, Mike and others},
  journal={arXiv preprint arXiv:2001.08210},
  year={2020}
}