---
datasets:
- atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset
language:
- ar
pipeline_tag: feature-extraction
---
# Moroccan Darija Embedding Models

This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated [Al Atlas dataset](https://huggingface.co/datasets/atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset) composed of Moroccan Darija text.

## Features
- **FastText embeddings**: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages.
- **Efficient training pipeline**: Code for training FastText embeddings on Moroccan Darija datasets.
- **Pre-trained models**: Ready-to-use embeddings for downstream NLP tasks are available in the [Hugging Face hub](https://huggingface.co/atlasia/Moroccan-Darija-Embedding)

## Installation
Clone the [Github repository](https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git) and install the required dependencies:

```bash
git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git
cd Moroccan-Darija-Embedding
pip install -r requirements.txt
```

## Usage
### Loading Pre-trained Embeddings
You can load the trained FastText model using `gensim`:

```python
import fasttext

model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub  https://huggingface.co/atlasia/Moroccan-Darija-Embedding
word_vector = model.get_word_vector("كلمة")
```

## Roadmap
- ✅ FastText embeddings
- ⏳ Word2Vec and GloVe embeddings
- ⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa)
- ⏳ Sentence embeddings: Continue training the [MoRdern-Bert](https://github.com/BounharAbdelaziz/MorDern-Bert) model.

## Contributing
Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase.