MArSum: Moroccan Articles Summarization dataset

Description

This dataset contains 19,806 news articles written in Moroccan Arabic dialect along with their titles. The articles were crawled from Goud.ma website between 01/01/2018 and 12/31/2020. The articles are written mainly in Moroccan Arabic dialect (Darija) but some of them contain Modern Standard Arabic (MSA) passages. All the titles are written in Darija. The following table summarize some tatistics on the MArSum Dataset.

Size Titles length Articles length
Min. Max. Avg. Min. Max. Avg.
19,806 2 74 14.6 30 2964 140.7

The following figure describes the creation process of MArSum:

alt text

You may refer to our paper, cited below, for more details on this process.

Dataset

The dataset is split into Train/Test subsets using a 90/10 split strategy. Both subsets are available for direct donwload.

Citation

Please cite the following paper if you decide to use the dataset:

Gaanoun, K., Naira, A. M., Allak, A., & Benelallam, I. (2022). Automatic Text Summarization for Moroccan Arabic Dialect
Using an Artificial Intelligence Approach. In International Conference on Business Intelligence (pp. 158-177). Springer, Cham.

License

The dataset is distributed under the CC BY 4.0 license.

Downloads last month
15
Safetensors
Model size
789M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.