metadata
language:
- ar
metrics:
- bleu
- accuracy
library_name: transformers
pipeline_tag: text-classification
tags:
- t5
- Classification
- ArabicT5
- Text Classification
widget:
- example_title: |
الديني
- text: >
الحمد لله رب العالمين والصلاة والسلام على سيد المرسلين نبينا محمد وآله
وصحبه أجمعين،وبعد:فإنه يجب على العبد أن يتجنب الذنوب كلها دقها وجلها
صغيرها وكبيرها وأن يتعاهد نفسه بالتوبة الصادقة والإنابة إلى ربه. قال
تعالى: (وَتُوبُوا إِلَى اللَّهِ جَمِيعًا أَيُّهَا الْمُؤْمِنُونَ
لَعَلَّكُمْ تُفْلِحُونَ)النور 31.
# Arabic text classification using deep learning (ArabicT5)
# Our experiment
The category mapping category_mapping = {
'Politics':1, 'Finance':2, 'Medical':3, 'Sports':4, 'Culture':5, 'Tech':6, 'Religion':7
}
- Training parameters
Training batch size | 8 |
Evaluation batch size | 8 |
Learning rate | 1e-4 |
Max length input | 200 |
Max length target | 3 |
Number workers | 4 |
Epoch | 2 |
- Results
Validation Loss | 0.0479 |
Accuracy | 96.49% |
BLeU | 96.49% |
# SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
# Arabic text classification using deep learning models
Paper [https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413]
Their experiment' "Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU." | Model | Accuracy |
| :---------------------: | :---------------------: |
| CGRU | 93.43% |
| HANGRU | 95.81% |
# Example usage
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
truncation=True,
padding="max_length",
return_tensors="pt"
)
output= model.generate(tokens['input_ids'],
max_length=3,
length_penalty=10)
output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output
['5']