File size: 3,490 Bytes
7b98338
 
 
 
 
21d186e
 
7b98338
21d186e
 
 
 
 
 
c566d40
9be51b7
21d186e
c566d40
7b98338
21d186e
 
9ec9d19
21d186e
9ec9d19
 
21d186e
9ec9d19
 
21d186e
 
9ec9d19
 
 
 
9964309
2f1ae44
 
 
 
9ec9d19
 
 
fafd096
4982cf0
fafd096
4982cf0
 
 
 
 
 
 
 
21d186e
 
 
 
 
 
 
 
9d527c8
21d186e
 
 
 
 
 
 
 
 
 
9d527c8
1e98aa1
21d186e
 
 
881051c
21d186e
881051c
 
 
 
 
 
 
 
 
 
 
 
 
 
2ad65fb
881051c
 
 
233d3ea
881051c
 
21d186e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
language:
- ar
metrics:
- bleu
- accuracy
library_name: transformers
pipeline_tag: text-classification
tags:
- t5
- Classification
- ArabicT5
- Text Classification
widget:
- example_title: > 
    الديني
- text: >
    الحمد لله رب العالمين والصلاة والسلام على سيد المرسلين نبينا محمد وآله وصحبه أجمعين،وبعد:فإنه يجب على العبد أن يتجنب الذنوب كلها دقها وجلها صغيرها وكبيرها وأن يتعاهد نفسه بالتوبة الصادقة والإنابة إلى ربه. قال تعالى: (وَتُوبُوا إِلَى اللَّهِ جَمِيعًا أَيُّهَا الْمُؤْمِنُونَ لَعَلَّكُمْ تُفْلِحُونَ)النور 31.
---

# # Arabic text classification using deep learning (ArabicT5)

  - SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
  
  - Paper
  [https://www.researchgate.net/publication/333605992_SANAD_Single-Label_Arabic_News_Articles_Dataset_for_Automatic_Text_Categorization]
  
  -Dataset
  [https://data.mendeley.com/datasets/57zpx667y9/2]

# # Their experiment'

[https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413]

"Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU."
|         Model           |         Accuracy        | 
| :---------------------: | :---------------------: | 
|           CGRU          |          93.43%         |   
|          HANGRU         |          95.81%         | 

# # Our experiment

# # The category mapping
  category_mapping = {
  
      'Politics':1,
      'Finance':2,
      'Medical':3,
      'Sports':4,
      'Culture':5,
      'Tech':6,
      'Religion':7
  }
  
# # Training parameters

|                       |              |
| :-------------------: | :-----------:|
|  Training batch size  |     `8`      |
| Evaluation batch size |     `8`      |
|     Learning rate     |    `1e-4`    |
|    Max length input   |    `200`     |
|   Max length target   |     `3`      |
|     Number workers    |     `4`      |
|         Epoch         |     `2`      |
|                       |              |

# # Results

|                         |               |
| :---------------------: | :-----------: | 
|   Validation Loss       |   `0.0479`    |  
|        Accuracy         |   `96.49%`    | 
|          BLeU           |   `96.49%`    |

# # Example usage
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name="Hezam/ArabicT5_Classification"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه  متابعه تفاجا زوار موقع القناه الاولي المغربي"
tokens=tokenizer(text, max_length=200,
                    truncation=True,
                    padding="max_length",
                    return_tensors="pt"
                )

output= model.generate(tokens['input_ids'],
                       max_length=3,
                       length_penalty=10)

output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output]
output

```
```bash
['5']
```