# Arabic text classification using deep learning (ArabicT5)

# Our experiment

  • The category mapping: category_mapping = { 'Politics':1, 'Finance':2, 'Medical':3, 'Sports':4, 'Culture':5, 'Tech':6, 'Religion':7 }

  • Training parameters | | |

| :-------------------: | :-----------:| | Training batch size | 8 | | Evaluation batch size | 8 | | Learning rate | 1e-4 | | Max length input | 200 | | Max length target | 3 | | Number workers | 4 | | Epoch | 2 | | | |

  • Results | | |

| :---------------------: | :-----------: | | Validation Loss | 0.0479 |
| Accuracy | 96.49% | | BLeU | 96.49% |

# SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

# Arabic text classification using deep learning models

| :---------------------: | :---------------------: | | CGRU | 93.43% |
| HANGRU | 95.81% |

# Example usage

from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه  متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
                    truncation=True,
                    padding="max_length",
                    return_tensors="pt"
                )

output= model.generate(tokens['input_ids'],
                       max_length=3,
                       length_penalty=10)

output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output
['5']
Downloads last month
30
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.