Fill-Mask
Transformers
PyTorch
xlm-roberta
Inference Endpoints
fenchri commited on
Commit
4a3c72a
·
1 Parent(s): cb08852

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md CHANGED
@@ -1,3 +1,127 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - af
5
+ - ar
6
+ - bg
7
+ - bn
8
+ - de
9
+ - el
10
+ - en
11
+ - es
12
+ - et
13
+ - eu
14
+ - fa
15
+ - fi
16
+ - fr
17
+ - he
18
+ - hi
19
+ - hu
20
+ - id
21
+ - it
22
+ - ja
23
+ - jv
24
+ - ka
25
+ - kk
26
+ - ko
27
+ - ml
28
+ - mr
29
+ - ms
30
+ - my
31
+ - nl
32
+ - pt
33
+ - ru
34
+ - sw
35
+ - ta
36
+ - te
37
+ - th
38
+ - tl
39
+ - tr
40
+ - ur
41
+ - vi
42
+ - yo
43
+ - zh
44
  ---
45
+
46
+ # Model Card for EntityCS-39-PEP_MS-xlmr-base
47
+
48
+ This model has been trained on the EntityCS corpus, a multilingual corpus from Wikipedia with replaces entities in different languages.
49
+ The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
50
+
51
+ To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
52
+ to an entity. By predicting the masked entities in ENTITYCS sentences, we expect the model to capture the semantics of the same entity in different
53
+ languages.
54
+ Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (`WEP`) and Partial Entity Prediction (`PEP`).
55
+
56
+ In WEP, motivated by [Sun et al. (2019)](https://arxiv.org/abs/1904.09223) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside
57
+ an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and
58
+ 20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked
59
+ entity, we do not allow replacing with Random subwords, since it can introduce noise and result
60
+ in the model predicting incorrect entities. After entities are masked, we remove the entity indicators
61
+ `<e>`, `</e>` from the sentences before feeding them to the model.
62
+
63
+ For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force
64
+ subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual
65
+ entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First,
66
+ PEP<sub>MRS</sub>, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining
67
+ subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second
68
+ setting, PEP<sub>MS</sub>, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked
69
+ subwords and 10% Same subwords from the masking candidates. In the third setting, PEP<sub>M</sub>, we
70
+ further remove the 10% Same subwords prediction, essentially predicting only the masked subwords.
71
+
72
+ Prior work has proven it is effective to combine
73
+ Entity Prediction with MLM for cross-lingual transfer ([Jiang et al., 2020](https://aclanthology.org/2020.emnlp-main.479/)), therefore we investigate the
74
+ combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the
75
+ entity masking probability (p) to 50% to roughly keep the same overall masking percentage.
76
+ This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM, PEP<sub>MS</sub> + MLM, PEP<sub>M</sub> + MLM
77
+
78
+ This model was trained with the **PEP<sub>MS</sub>** objective on the EntityCS corpus with 39 languages.
79
+
80
+
81
+ ## Model Details
82
+
83
+ ### Training Details
84
+
85
+ We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
86
+ We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
87
+ For speedup we use fp16 mixed precision.
88
+ We use the sampling strategy proposed by [Conneau and Lample (2019)](https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf), where high resource languages are down-sampled and low
89
+ resource languages get sampled more frequently.
90
+ We only train the embedding and the last two layers of the model.
91
+ We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
92
+
93
+ **This checkpoint corresponds to the one with the lower perplexity on the validation set.**
94
+
95
+
96
+ ## Usage
97
+
98
+ The current model can be used for further fine-tuning on downstream tasks.
99
+ In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation, Fact Retrieval and Slot Filling.
100
+
101
+ ## How to Get Started with the Model
102
+
103
+ Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
104
+
105
+ ## Citation
106
+
107
+ **BibTeX:**
108
+
109
+ ```html
110
+ @inproceedings{whitehouse-etal-2022-entitycs,
111
+ title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching",
112
+ author = "Whitehouse, Chenxi and
113
+ Christopoulou, Fenia and
114
+ Iacobacci, Ignacio",
115
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
116
+ month = dec,
117
+ year = "2022",
118
+ address = "Abu Dhabi, United Arab Emirates",
119
+ publisher = "Association for Computational Linguistics",
120
+ url = "https://aclanthology.org/2022.findings-emnlp.499",
121
+ pages = "6698--6714"
122
+ }
123
+ ```
124
+
125
+ ## Model Card Contact
126
+
127
+ [Fenia Christopoulou]([email protected])