naghamghanim commited on
Commit
d33c9e7
ยท
verified ยท
1 Parent(s): 7a5087b

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +39 -0
  2. config.json +17 -0
  3. special_tokens_map.json +1 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +1 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - Wojood
5
+ tags:
6
+ - Named Entity Recognition
7
+ - Arabic NER
8
+ - Nested NER
9
+ language:
10
+ - ar
11
+ metrics:
12
+ - f1
13
+ - precision
14
+ - recall
15
+ pipeline_tag: token-classification
16
+ ---
17
+
18
+ ## Wojood - Nested/Flat Arabic NER Models
19
+ Wojood is a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. 550K tokens (MSA and dialect) This repo contains the source-code to train Wojood nested NER.
20
+
21
+ Online Demo
22
+ You can try our model using the demo link below
23
+
24
+ https://sina.birzeit.edu/wojood/
25
+
26
+ https://arxiv.org/abs/2205.09651
27
+
28
+ https://huggingface.co/aubmindlab/bert-base-arabertv2/tree/main
29
+
30
+ ### Models
31
+ * Nested NER (main branch), with micro-F1 score of 0.909551
32
+ * Flat NER (flat branch), with micro-F1 score 0.883847
33
+
34
+ ### Google Colab Notebooks
35
+ You can test our model using our Google Colab notebooks
36
+ * Train flat NER: https://gist.github.com/mohammedkhalilia/72c3261734d7715094089bdf4de74b4a
37
+ * Evaluate your model using flat NER model: https://gist.github.com/mohammedkhalilia/c807eb1ccb15416b187c32a362001665
38
+ * Train nested NER: https://gist.github.com/mohammedkhalilia/a4d83d4e43682d1efcdf299d41beb3da
39
+ * Evaluate your data using nested NER model: https://gist.github.com/mohammedkhalilia/9134510aa2684464f57de7934c97138b
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 768,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 3072,
11
+ "max_position_embeddings": 512,
12
+ "model_type": "bert",
13
+ "num_attention_heads": 12,
14
+ "num_hidden_layers": 12,
15
+ "type_vocab_size": 2,
16
+ "vocab_size": 64000
17
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "max_len": 512, "do_basic_tokenize": true, "never_split": ["+ูƒ", "+ูƒู…ุง", "ูƒ+", "+ูˆุง", "+ูŠู†", "ูˆ+", "+ูƒู†", "+ุงู†", "+ู‡ู…", "+ุฉ", "[ุจุฑูŠุฏ]", "ู„ู„+", "+ูŠ", "+ุช", "+ู†", "ุณ+", "ู„+", "[ู…ุณุชุฎุฏู…]", "+ูƒู…", "+ุง", "ุจ+", "ู+", "+ู†ุง", "+ู‡ุง", "+ูˆู†", "+ู‡ู…ุง", "ุงู„+", "+ู‡", "+ู‡ู†", "+ุงุช", "[ุฑุงุจุท]"], "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "aubmindlab/bert-large-arabertv2"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff