--- datasets: - atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset language: - ar pipeline_tag: feature-extraction --- # Moroccan Darija Embedding Models This repository contains word embedding models trained for Moroccan Darija, a widely spoken Arabic dialect in Morocco. Currently, it includes FastText-based embeddings trained on the curated [Al Atlas dataset](https://huggingface.co/datasets/atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset) composed of Moroccan Darija text. ## Features - **FastText embeddings**: Pre-trained word vectors using FastText, which supports subword information and works well with dialectal and morphologically rich languages. - **Efficient training pipeline**: Code for training FastText embeddings on Moroccan Darija datasets. - **Pre-trained models**: Ready-to-use embeddings for downstream NLP tasks are available in the [Hugging Face hub](https://huggingface.co/atlasia/Moroccan-Darija-Embedding) ## Installation Clone the [Github repository](https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git) and install the required dependencies: ```bash git clone https://github.com/BounharAbdelaziz/Moroccan-Darija-Embedding.git cd Moroccan-Darija-Embedding pip install -r requirements.txt ``` ## Usage ### Loading Pre-trained Embeddings You can load the trained FastText model using `gensim`: ```python import fasttext model = fasttext.load_model("fasttext_cbow_v0.bin") # download the models from the hub https://huggingface.co/atlasia/Moroccan-Darija-Embedding word_vector = model.get_word_vector("كلمة") ``` ## Roadmap - ✅ FastText embeddings - ⏳ Word2Vec and GloVe embeddings - ⏳ Transformer-based contextual embeddings (e.g., BERT, RoBERTa) - ⏳ Sentence embeddings: Continue training the [MoRdern-Bert](https://github.com/BounharAbdelaziz/MorDern-Bert) model. ## Contributing Contributions are welcome! Feel free to open issues or submit pull requests to improve the models and codebase.