alsubari commited on
Commit
f323792
·
1 Parent(s): a8ea08d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -151
README.md CHANGED
@@ -5,9 +5,6 @@ pipeline_tag: text-generation
5
  ---
6
  # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
  ## Model Details
12
 
13
  ### Model Description
@@ -21,174 +18,109 @@ pipeline_tag: text-generation
21
  ## Uses
22
 
23
 
24
- 1. The model can be helpful for the arabic langauge students/researchers, since it provide the full sentence anaylsis (اعراب الجملة ) in arabic language.
25
- 2.
26
-
27
-
28
- ### Out-of-Scope Use
29
-
30
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
31
- 1. This model can't be use for grammar check, since it dail with high level of arabic correct sentence as input
32
- 2. Don't use arabic dailects in input sentence.
33
- 3.
34
- 4.
35
-
36
- [More Information Needed]
37
-
38
- ## Bias, Risks, and Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
41
-
42
- [More Information Needed]
43
-
44
- ### Recommendations
45
-
46
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
47
-
48
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
49
 
50
  ## How to Get Started with the Model
51
 
52
  ```python
53
  from transformers import GPT2Tokenizer
54
- from arabert.preprocess import ArabertPreprocessor
55
  from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
56
- from pyarabic.araby import strip_tashkeel
57
- import pyarabic.trans
58
  model_name='alsubari/aragpt2-mega-pos-msa'
59
 
60
 
61
  tokenizer = GPT2Tokenizer.from_pretrained('alsubari/aragpt2-mega-pos-msa')
62
  model = GPT2LMHeadModel.from_pretrained('alsubari/aragpt2-mega-pos-msa').to("cuda")
63
 
64
- arabert_prep = ArabertPreprocessor(model_name='aubmindlab/aragpt2-mega')
65
- prml=['اعراب الجملة :', ' صنف الكلمات من الجملة :']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  text='تعلَّمْ من أخطائِكَ'
67
- text=arabert_prep.preprocess(strip_tashkeel(text))
68
- generation_args = {
69
- 'pad_token_id':tokenizer.eos_token_id,
70
- 'max_length': 256,
71
- 'num_beams':20,
72
- 'no_repeat_ngram_size': 3,
73
- 'top_k': 20,
74
- 'top_p': 0.1, # Consider all tokens with non-zero probability
75
- 'do_sample': True,
76
- 'repetition_penalty':2.0
77
- }
78
-
79
- ##Pose Tagging
80
- input_text = f'<|startoftext|>Instruction: {prml[1]} {text}<|pad|>Answer:'
81
- input_ids = tokenizer.encode(input_text, return_tensors='pt').to("cuda")
82
- output_ids = model.generate(input_ids=input_ids,**generation_args)
83
- output_text = tokenizer.decode(output_ids[0],skip_special_tokens=True).split('Answer:')[1]
84
- answer_pose=pyarabic.trans.delimite_language(output_text, start="<token>", end="</token>")
85
-
86
- print(answer_pose)
87
- # <token>تعلم : تعلم</token> : Verb <token>من : من</token> : Relative pronoun <token>أخطائك : اخطا</token> : Noun <token>ك</token> : Personal pronunction
88
-
89
- ##Arabic Sentence Analysis
90
- input_text = f'<|startoftext|>Instruction: {prml[0]} {text}<|pad|>Answer:'
91
- input_ids = tokenizer.encode(input_text, return_tensors='pt').to("cuda")
92
- output_ids = model.generate(input_ids=input_ids,**generation_args)
93
- output_text = tokenizer.decode(output_ids[0],skip_special_tokens=True).split('Answer:')[1]
94
-
95
- print(output_text)
96
- #تعلم : تعلم : فعل ، مفرد المخاطب للمذكر ، فعل مضارع ، مرفوع من : من : حرف جر أخطائك : اخطا : اسم ، جمع المذكر ، مجرور ك : ضمير ، مفرد المتكلم
97
  ```
98
 
99
- ## Evaluation
100
-
101
- <!-- This section describes the evaluation protocols and provides the results. -->
102
-
103
- ### Testing Data, Factors & Metrics
104
-
105
- #### Testing Data
106
-
107
- <!-- This should link to a Data Card if possible. -->
108
-
109
- [More Information Needed]
110
-
111
- #### Factors
112
-
113
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
114
-
115
- [More Information Needed]
116
-
117
- #### Metrics
118
-
119
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
120
-
121
- [More Information Needed]
122
 
123
  ### Results
124
 
125
- [More Information Needed]
126
-
127
- #### Summary
128
-
129
-
130
-
131
- ## Model Examination [optional]
132
-
133
- <!-- Relevant interpretability work for the model goes here -->
134
-
135
- [More Information Needed]
136
-
137
- ## Environmental Impact
138
-
139
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
140
-
141
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
142
-
143
- - **Hardware Type:** [More Information Needed]
144
- - **Hours used:** [More Information Needed]
145
- - **Cloud Provider:** [More Information Needed]
146
- - **Compute Region:** [More Information Needed]
147
- - **Carbon Emitted:** [More Information Needed]
148
-
149
- ## Technical Specifications [optional]
150
-
151
- ### Model Architecture and Objective
152
-
153
- [More Information Needed]
154
-
155
- ### Compute Infrastructure
156
-
157
- [More Information Needed]
158
-
159
- #### Hardware
160
-
161
- [More Information Needed]
162
-
163
- #### Software
164
-
165
- [More Information Needed]
166
-
167
- ## Citation [optional]
168
-
169
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
170
-
171
- **BibTeX:**
172
-
173
- [More Information Needed]
174
-
175
- **APA:**
176
-
177
- [More Information Needed]
178
-
179
- ## Glossary [optional]
180
-
181
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
182
-
183
- [More Information Needed]
184
-
185
- ## More Information [optional]
186
-
187
- [More Information Needed]
188
-
189
- ## Model Card Authors [optional]
190
 
191
- [More Information Needed]
192
 
193
  ## Model Card Contact
194
 
 
5
  ---
6
  # Model Card for Model ID
7
 
 
 
 
8
  ## Model Details
9
 
10
  ### Model Description
 
18
  ## Uses
19
 
20
 
21
+ 1. pose tagging for arabic language and it may use for other languages
22
+ 2. The model can be helpful for the arabic langauge students/researchers, since it provide the sentence anaylsis (اعراب الجملة ) in the context.
23
+ 3. arabic word toknizer
24
+ 4. it may use for translate the arabic dailects to MSA
25
+
26
+
27
+
28
+
29
+ ## Main Labels
30
+
31
+ {'حرف جر': 'preposition',
32
+ 'اسم': 'noun',
33
+ 'اسم علم': 'proper noun',
34
+ 'لام التعريف': 'determiner',
35
+ 'صفة': 'adjective',
36
+ 'ضمير': 'personal pronoun',
37
+ 'فعل': 'verb',
38
+ 'حرف عطف': 'conjunction',
39
+ 'اسم موصول': 'relative pronoun',
40
+ 'حرف نفي': 'negative particle',
41
+ 'حروف مقطعة': 'quranic initials',
42
+ 'اسم اشارة': 'demonstrative pronoun',
43
+ 'حرف استئنافية': 'resumption',
44
+ 'حرف نصب': 'accusative particle',
45
+ 'حرف تسوية': 'equalization particle',
46
+ 'حرف حال': 'circumstantial particle',
47
+ 'أداة حصر': 'restriction particle',
48
+ 'ظرف زمان': 'time adverb',
49
+ 'حرف نهي': 'prohibition particle',
50
+ 'حرف كاف': 'preventive particle',
51
+ 'حرف ابتداء': 'inceptive particle',
52
+ 'حرف زائد': 'supplemental particle',
53
+ 'حرف استدراك': 'amendment particle',
54
+ 'حرف مصدري': 'subordinating conjunction',
55
+ 'حرف استفهام': 'interrogative particle',
56
+ 'ظرف مكان': 'location adverb',
57
+ 'حرف شرط': 'conditional particle',
58
+ 'لام التوكيد': 'emphatic',
59
+ 'حرف نداء': 'vocative particle',
60
+ 'حرف واقع في جواب الشرط': 'result particle',
61
+ 'حرف تفصيل': 'explanation particle',
62
+ 'أداة استثناء': 'exceptive particle',
63
+ 'حرف سببية': 'particle of cause',
64
+ 'التوكيد - النون الثقيلة': 'heavy noon emphesis',
65
+ 'حرف استقبال': 'future particle',
66
+ 'حرف تحقيق': 'particle of certainty',
67
+ 'لام التعليل': 'purpose',
68
+ 'حرف جواب': 'answer particle',
69
+ 'حرف اضراب': 'retraction particle',
70
+ 'حرف تحضيض': 'exhortation particle',
71
+ 'حرف تفسير': 'particle of interpretation',
72
+ 'لام الامر': 'imperative',
73
+ 'واو المعية': 'comitative particle',
74
+ 'حرف فجاءة': 'surprise particle',
75
+ 'حرف ردع': 'aversion particle',
76
+ 'اسم فعل أمر': 'imperative verbal noun'}
77
 
 
 
 
 
 
 
 
 
 
78
 
79
  ## How to Get Started with the Model
80
 
81
  ```python
82
  from transformers import GPT2Tokenizer
83
+ from pyarabic.araby import strip_diacritics,strip_tatweel
84
  from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel
85
+ from transformers import pipeline
86
+
87
  model_name='alsubari/aragpt2-mega-pos-msa'
88
 
89
 
90
  tokenizer = GPT2Tokenizer.from_pretrained('alsubari/aragpt2-mega-pos-msa')
91
  model = GPT2LMHeadModel.from_pretrained('alsubari/aragpt2-mega-pos-msa').to("cuda")
92
 
93
+ generator = pipeline("text-generation",model=model,tokenizer=tokenizer,device=0)
94
+ def generate(text):
95
+ prompt = f'<|startoftext|>Instruction: {text}<|pad|>Answer:'
96
+ pred_text= generator(prompt,
97
+ pad_token_id=tokenizer.eos_token_id,
98
+ num_beams=20,
99
+ max_length=256,
100
+ #min_length = 200,
101
+ do_sampling=False,
102
+ top_p=0.5,
103
+ top_k=1,
104
+ repetition_penalty = 3.0,
105
+ # temperature=0.8,
106
+ no_repeat_ngram_size = 3)[0]['generated_text']
107
+ try:
108
+ pred_sentiment = re.findall("Answer:(.*)", pred_text,re.S)[-1]
109
+ except:
110
+ pred_sentiment = "None"
111
+
112
+ return pred_sentiment
113
  text='تعلَّمْ من أخطائِكَ'
114
+ generate(strip_tatweel(strip_diacritics(text)))
115
+ #' تعلم ( تعلم : فعل ) من ( من : حرف جر ) أخطائك ( اخطاء : اسم ، ك : ضمير )'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  ```
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ### Results
120
 
121
+ Epoch Training Loss Validation Loss
122
+ 1 0.108500 0.082612
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
 
124
 
125
  ## Model Card Contact
126