File size: 1,698 Bytes
5d29828
 
 
 
4fe8de9
c33f042
ae0ce69
a552278
e554aae
4fb40df
 
 
 
 
 
0a71205
f80dbe7
892b749
bc61fa1
5229259
35d9215
0508099
 
dfc47cb
b1a8188
0508099
 
b1a8188
 
0508099
 
35d9215
 
c11444d
35d9215
 
 
 
d84a36e
35d9215
c11444d
 
 
 
f59a330
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
license: apache-2.0
language: fa
widget:
 - text: "این بود [MASK] های ما؟"
 - text: "داداچ داری [MASK] میزنی"
 - text: 'به علی [MASK] میگفتن جادوگر'
 - text: 'آخه محسن [MASK] هم شد خواننده؟'
 - text: 'پسر عجب [MASK] زد'
tags:
- BERTweet
model-index:
- name: BERTweet-FA
  results: []
---

BERTweet-FA: A pre-trained language model for Persian (a.k.a Farsi) Tweets
---

BERTweet-FA is a transformer-based model trained on 20665964 Persian tweets. The model has been trained on the data only for 1 epoch (322906 steps), and yet it has the ability to recognize the meaning of most of the conversational sentences used in Farsi. Note that the architecture of this model follows the original BERT [[Devlin et al.](https://arxiv.org/abs/1810.04805)].

How to use the Model
---
```python
from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained('arm-on/BERTweet-FA')
tokenizer = BertTokenizer.from_pretrained('arm-on/BERTweet-FA')
fill_sentence = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_sentence('اینجا جمله مورد نظر خود را بنویسید و کلمه موردنظر را [MASK] کنید')
```

The Training Data
---
The first version of the model was trained on the "[Large Scale Colloquial Persian Dataset](https://iasbs.ac.ir/~ansari/lscp/)" containing more than 20 million tweets in Farsi, gathered by Khojasteh et al., and published on 2020.

Evaluation
---

| Training Loss | Epoch | Step  |
|:-------------:|:-----:|:-----:|
| 0.0036        | 1.0   | 322906 |

Contributors
---
- Arman Malekzadeh [[Github](https://github.com/arm-on)]