|
--- |
|
language: |
|
- ar |
|
metrics: |
|
- bleu |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: text-classification |
|
tags: |
|
- t5 |
|
- Classification |
|
- ArabicT5 |
|
- Text Classification |
|
widget: |
|
- example_title: > |
|
الديني |
|
- text: > |
|
الحمد لله رب العالمين والصلاة والسلام على سيد المرسلين نبينا محمد وآله وصحبه أجمعين،وبعد:فإنه يجب على العبد أن يتجنب الذنوب كلها دقها وجلها صغيرها وكبيرها وأن يتعاهد نفسه بالتوبة الصادقة والإنابة إلى ربه. قال تعالى: (وَتُوبُوا إِلَى اللَّهِ جَمِيعًا أَيُّهَا الْمُؤْمِنُونَ لَعَلَّكُمْ تُفْلِحُونَ)النور 31. |
|
--- |
|
|
|
# # Arabic text classification using deep learning (ArabicT5) |
|
|
|
# # Our experiment |
|
|
|
- The category mapping |
|
category_mapping = { |
|
|
|
'Politics':1, |
|
'Finance':2, |
|
'Medical':3, |
|
'Sports':4, |
|
'Culture':5, |
|
'Tech':6, |
|
'Religion':7 |
|
} |
|
|
|
- Training parameters |
|
|
|
| | | |
|
| :-------------------: | :-----------:| |
|
| Training batch size | `8` | |
|
| Evaluation batch size | `8` | |
|
| Learning rate | `1e-4` | |
|
| Max length input | `200` | |
|
| Max length target | `3` | |
|
| Number workers | `4` | |
|
| Epoch | `2` | |
|
| | | |
|
|
|
- Results |
|
|
|
| | | |
|
| :---------------------: | :-----------: | |
|
| Validation Loss | `0.0479` | |
|
| Accuracy | `96.49%` | |
|
| BLeU | `96.49%` | |
|
|
|
# # SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
|
|
|
- Paper |
|
[https://www.researchgate.net/publication/333605992_SANAD_Single-Label_Arabic_News_Articles_Dataset_for_Automatic_Text_Categorization] |
|
|
|
- Dataset |
|
[https://data.mendeley.com/datasets/57zpx667y9/2] |
|
|
|
# # Arabic text classification using deep learning models |
|
|
|
- Paper |
|
[https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413] |
|
|
|
- Their experiment' |
|
"Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU." |
|
| Model | Accuracy | |
|
| :---------------------: | :---------------------: | |
|
| CGRU | 93.43% | |
|
| HANGRU | 95.81% | |
|
|
|
|
|
# # Example usage |
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
model_name="Hezam/ArabicT5_Classification" |
|
model = T5ForConditionalGeneration.from_pretrained(model_name) |
|
tokenizer = T5Tokenizer.from_pretrained(model_name) |
|
|
|
text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه متابعه تفاجا زوار موقع القناه الاولي المغربي" |
|
tokens=tokenizer(text, max_length=200, |
|
truncation=True, |
|
padding="max_length", |
|
return_tensors="pt" |
|
) |
|
|
|
output= model.generate(tokens['input_ids'], |
|
max_length=3, |
|
length_penalty=10) |
|
|
|
output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output] |
|
output |
|
|
|
``` |
|
```bash |
|
['5'] |
|
``` |