--- language: - ar metrics: - bleu - accuracy library_name: transformers pipeline_tag: text-classification tags: - t5 - Classification - ArabicT5 - Text Classification widget: - example_title: > الديني - text: > الحمد لله رب العالمين والصلاة والسلام على سيد المرسلين نبينا محمد وآله وصحبه أجمعين،وبعد:فإنه يجب على العبد أن يتجنب الذنوب كلها دقها وجلها صغيرها وكبيرها وأن يتعاهد نفسه بالتوبة الصادقة والإنابة إلى ربه. قال تعالى: (وَتُوبُوا إِلَى اللَّهِ جَمِيعًا أَيُّهَا الْمُؤْمِنُونَ لَعَلَّكُمْ تُفْلِحُونَ)النور 31. --- # # Arabic text classification using deep learning (ArabicT5) - SANAD: Single-label Arabic News Articles Dataset for automatic text categorization - Paper [https://www.researchgate.net/publication/333605992_SANAD_Single-Label_Arabic_News_Articles_Dataset_for_Automatic_Text_Categorization] -Dataset [https://data.mendeley.com/datasets/57zpx667y9/2] # # Their experiment' [https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413] "Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU." | Model | Accuracy | | :---------------------: | :---------------------: | | CGRU | 93.43% | | HANGRU | 95.81% | # # Our experiment # # The category mapping category_mapping = { 'Politics':1, 'Finance':2, 'Medical':3, 'Sports':4, 'Culture':5, 'Tech':6, 'Religion':7 } # # Training parameters | | | | :-------------------: | :-----------:| | Training batch size | `8` | | Evaluation batch size | `8` | | Learning rate | `1e-4` | | Max length input | `200` | | Max length target | `3` | | Number workers | `4` | | Epoch | `2` | | | | # # Results | | | | :---------------------: | :-----------: | | Validation Loss | `0.0479` | | Accuracy | `96.49%` | | BLeU | `96.49%` | # # Example usage ```python from transformers import T5ForConditionalGeneration, T5Tokenizer model_name="Hezam/ArabicT5_Classification" model = T5ForConditionalGeneration.from_pretrained(model_name) tokenizer = T5Tokenizer.from_pretrained(model_name) text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه متابعه تفاجا زوار موقع القناه الاولي المغربي" tokens=tokenizer(text, max_length=200, truncation=True, padding="max_length", return_tensors="pt" ) output= model.generate(tokens['input_ids'], max_length=3, length_penalty=10) output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output] output ``` ```bash ['5'] ```