Ahmed Abdelali commited on
Commit
f089a56
·
1 Parent(s): ffc3109

push farasa base model

Browse files
.gitattributes CHANGED
@@ -14,3 +14,5 @@
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
 
 
 
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
18
+ model.ckpt.data-00000-of-00001 filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ tags:
4
+ - pytorch
5
+ - tf
6
+ - QARiB
7
+ - qarib
8
+ datasets:
9
+ - arabic_billion_words
10
+ - open_subtitles
11
+ - twitter
12
+ - Farasa
13
+ metrics:
14
+ - f1
15
+ widget:
16
+ - text: "و+قام ال+مدير [MASK]"
17
+ ---
18
+ # QARiB: QCRI Arabic and Dialectal BERT
19
+ ## About QARiB Farasa
20
+ QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
21
+ For the tweets, the data was collected using twitter API and using language filter. `lang:ar`. For the text data, it was a combination from
22
+ [Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
23
+ QARiB: Is the Arabic name for "Boat".
24
+ ## Model and Parameters:
25
+ - Data size: 14B tokens
26
+ - Vocabulary: 64k
27
+ - Iterations: 10M
28
+ - Number of Layers: 12
29
+ ## Training QARiB
30
+ See details in [Training QARiB](https://github.com/qcri/QARIB/Training_QARiB.md)
31
+ ## Using QARiB
32
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](https://github.com/qcri/QARIB/Using_QARiB.md)
33
+
34
+ This model expects the data to be segmented. You may use [Farasa Segmenter](https://farasa-api.qcri.org/segmentation/) API.
35
+
36
+ ### How to use
37
+ You can use this model directly with a pipeline for masked language modeling:
38
+ ```python
39
+ >>>from transformers import pipeline
40
+ >>>fill_mask = pipeline("fill-mask", model="./models/bert-base-qarib_far")
41
+ >>> fill_mask("و+قام ال+مدير [MASK]")
42
+ [
43
+ {'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
44
+ {'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
45
+ {'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
46
+ {'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
47
+ {'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
48
+ ]
49
+ >>> fill_mask("و+قام+ت ال+مدير+ة [MASK]")
50
+ [{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
51
+ {'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
52
+ {'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
53
+ {'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
54
+ {'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
55
+ >>> fill_mask("قللي وشفيييك يرحم [MASK]")
56
+ [{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
57
+ {'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
58
+ {'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
59
+ {'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
60
+ {'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
61
+ ```
62
+ ## Evaluations:
63
+ |**Experiment** |**mBERT**|**AraBERT0.1**|**AraBERT1.0**|**ArabicBERT**|**QARiB**|
64
+ |---------------|---------|--------------|--------------|--------------|---------|
65
+ |Dialect Identification | 6.06% | 59.92% | 59.85% | 61.70% | **65.21%** |
66
+ |Emotion Detection | 27.90% | 43.89% | 42.37% | 41.65% | **44.35%** |
67
+ |Named-Entity Recognition (NER) | 49.38% | 64.97% | **66.63%** | 64.04% | 61.62% |
68
+ |Offensive Language Detection | 83.14% | 88.07% | 88.97% | 88.19% | **91.94%** |
69
+ |Sentiment Analysis | 86.61% | 90.80% | **93.58%** | 83.27% | 93.31% |
70
+ ## Model Weights and Vocab Download
71
+ From Huggingface site: https://huggingface.co/qarib/bert-base-qarib_far
72
+ ## Contacts
73
+ Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
74
+ ## Reference
75
+ ```
76
+ @article{abdelali2021pretraining,
77
+ title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
78
+ author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
79
+ year={2021},
80
+ eprint={2102.10684},
81
+ archivePrefix={arXiv},
82
+ primaryClass={cs.CL}
83
+ }
84
+ ```
config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0.1,
3
+ "directionality": "bidi",
4
+ "hidden_act": "gelu",
5
+ "hidden_dropout_prob": 0.1,
6
+ "hidden_size": 768,
7
+ "initializer_range": 0.02,
8
+ "intermediate_size": 3072,
9
+ "max_position_embeddings": 512,
10
+ "num_attention_heads": 12,
11
+ "num_hidden_layers": 12,
12
+ "pooler_fc_size": 768,
13
+ "pooler_num_attention_heads": 12,
14
+ "pooler_num_fc_layers": 3,
15
+ "pooler_size_per_head": 128,
16
+ "pooler_type": "first_token_transform",
17
+ "type_vocab_size": 2,
18
+ "vocab_size": 64000
19
+ }
model.ckpt.data-00000-of-00001 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e78e8c1df6e1ad6ff226c4f1a2366264fbb53479fabc3783a8a9dd8663f226e4
3
+ size 1630212128
model.ckpt.index ADDED
Binary file (9.38 kB). View file
 
model.ckpt.meta ADDED
Binary file (4.71 MB). View file
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32c0d5c9228748942e17413080b9eeca1d7869de6c8d8e9b96daf76216649e5d
3
+ size 543488365
vocab.txt ADDED
The diff for this file is too large to render. See raw diff