chinmaydan commited on
Commit
40d9ae8
·
1 Parent(s): 95a3ca6

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -164
README.md DELETED
@@ -1,164 +0,0 @@
1
- # Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021
2
- The code for training mCOLT/mRASP2, a multilingual neural machine translation training method, implemented based on [fairseq](https://github.com/pytorch/fairseq).
3
-
4
- **mRASP2**: [paper](https://arxiv.org/abs/2105.09501) [blog](https://medium.com/@panxiao1994/mrasp2-multilingual-nmt-advances-via-contrastive-learning-ac8c4c35d63)
5
-
6
- **mRASP**: [paper](https://www.aclweb.org/anthology/2020.emnlp-main.210.pdf),
7
- [code](https://github.com/linzehui/mRASP)
8
-
9
- ---
10
- ## News
11
- We have released two versions, this version is the original one. In this implementation:
12
- - You should first merge all data, by pre-pending language token before each sentence to indicate the language.
13
- - AA/RAS muse be done off-line (before binarize), check [this toolkit](https://github.com/linzehui/mRASP/blob/master/preprocess).
14
-
15
- **New implementation**: https://github.com/PANXiao1994/mRASP2/tree/new_impl
16
-
17
- * Acknowledgement: This work is supported by [Bytedance](https://bytedance.com). We thank [Chengqi](https://github.com/zhaocq-nlp) for uploading all files and checkpoints.
18
-
19
- ## Introduction
20
-
21
- mRASP2/mCOLT, representing multilingual Contrastive Learning for Transformer, is a multilingual neural machine translation model that supports complete many-to-many multilingual machine translation. It employs both parallel corpora and multilingual corpora in a unified training framework. For detailed information please refer to the paper.
22
-
23
- ![img.png](docs/img.png)
24
-
25
- ## Pre-requisite
26
- ```bash
27
- pip install -r requirements.txt
28
- # install fairseq
29
- git clone https://github.com/pytorch/fairseq
30
- cd fairseq
31
- pip install --editable ./
32
- ```
33
-
34
- ## Training Data and Checkpoints
35
- We release our preprocessed training data and checkpoints in the following.
36
- ### Dataset
37
-
38
- We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.
39
-
40
- | Dataset | #Pair |
41
- | --- | --- |
42
- | [32-lang-pairs-TRAIN](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_parallel/download.sh) | 197603294 |
43
- | [32-lang-pairs-RAS-TRAIN](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_parallel_ras/download.sh) | 262662792 |
44
- | [mono-split-a](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_split_a/download.sh) | - |
45
- | [mono-split-b](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_split_b/download.sh) | - |
46
- | [mono-split-c](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_split_c/download.sh) | - |
47
- | [mono-split-d](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_split_d/download.sh) | - |
48
- | [mono-split-e](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_split_e/download.sh) | - |
49
- | [mono-split-de-fr-en](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_de_fr_en/download.sh) | - |
50
- | [mono-split-nl-pl-pt](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_mono_nl_pl_pt/download.sh) | - |
51
- | [32-lang-pairs-DEV-en-centric](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_dev_en_centric/download.sh) | - |
52
- | [32-lang-pairs-DEV-many-to-many](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bin_dev_m2m/download.sh) | - |
53
- | [Vocab](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/bpe_vocab) | - |
54
- | [BPE Code](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/emnlp2020/mrasp/pretrain/dataset/codes.bpe.32000) | - |
55
-
56
-
57
- ### Checkpoints & Results
58
- * **Please note that the provided checkpoint is sightly different from that in the paper.** In the following sections, we report the results of the provided checkpoints.
59
-
60
- #### English-centric Directions
61
- We report **tokenized BLEU** in the following table. Please click the model links to download. It is in pytorch format. (check eval.sh for details)
62
-
63
- |Models | [6e6d-no-mono](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/6e6d_no_mono.pt) | [12e12d-no-mono](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/12e12d_no_mono.pt) | [12e12d](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/12e12d_last.pt) |
64
- | --- | --- | --- | --- |
65
- | en2cs/wmt16 | 21.0 | 22.3 | 23.8 |
66
- | cs2en/wmt16 | 29.6 | 32.4 | 33.2 |
67
- | en2fr/wmt14 | 42.0 | 43.3 | 43.4 |
68
- | fr2en/wmt14 | 37.8 | 39.3 | 39.5 |
69
- | en2de/wmt14 | 27.4 | 29.2 | 29.5 |
70
- | de2en/wmt14 | 32.2 | 34.9 | 35.2 |
71
- | en2zh/wmt17 | 33.0 | 34.9 | 34.1 |
72
- | zh2en/wmt17 | 22.4 | 24.0 | 24.4 |
73
- | en2ro/wmt16 | 26.6 | 28.1 | 28.7 |
74
- | ro2en/wmt16 | 36.8 | 39.0 | 39.1 |
75
- | en2tr/wmt16 | 18.6 | 20.3 | 21.2 |
76
- | tr2en/wmt16 | 22.2 | 25.5 | 26.1 |
77
- | en2ru/wmt19 | 17.4 | 18.5 | 19.2 |
78
- | ru2en/wmt19 | 22.0 | 23.2 | 23.6 |
79
- | en2fi/wmt17 | 20.2 | 22.1 | 22.9 |
80
- | fi2en/wmt17 | 26.1 | 29.5 | 29.7 |
81
- | en2es/wmt13 | 32.8 | 34.1 | 34.6 |
82
- | es2en/wmt13 | 32.8 | 34.6 | 34.7 |
83
- | en2it/wmt09 | 28.9 | 30.0 | 30.8 |
84
- | it2en/wmt09 | 31.4 | 32.7 | 32.8 |
85
-
86
- #### Unsupervised Directions
87
- We report **tokenized BLEU** in the following table. (check eval.sh for details)
88
-
89
- | | 12e12d |
90
- | --- | --- |
91
- | en2pl/wmt20 | 6.2 |
92
- | pl2en/wmt20 | 13.5 |
93
- | en2nl/iwslt14 | 8.8 |
94
- | nl2en/iwslt14 | 27.1 |
95
- | en2pt/opus100 | 18.9 |
96
- | pt2en/opus100 | 29.2 |
97
-
98
- #### Zero-shot Directions
99
- * row: source language
100
- * column: target language
101
- We report **[sacreBLEU](https://github.com/mozilla/sacreBLEU)** in the following table.
102
-
103
- | 12e12d | ar | zh | nl | fr | de | ru |
104
- | --- | --- | --- | --- | --- | --- | --- |
105
- | ar | - | 32.5 | 3.2 | 22.8 | 11.2 | 16.7 |
106
- | zh | 6.5 | - | 1.9 | 32.9 | 7.6 | 23.7 |
107
- | nl | 1.7 | 8.2 | - | 7.5 | 10.2 | 2.9 |
108
- | fr | 6.2 | 42.3 | 7.5 | - | 18.9 | 24.4 |
109
- | de | 4.9 | 21.6 | 9.2 | 24.7 | - | 14.4 |
110
- | ru | 7.1 | 40.6 | 4.5 | 29.9 | 13.5 | - |
111
-
112
- ## Training
113
- ```bash
114
- export NUM_GPU=4 && bash train_w_mono.sh ${model_config}
115
- ```
116
- * We give example of `${model_config}` in `${PROJECT_REPO}/examples/configs/parallel_mono_12e12d_contrastive.yml`
117
-
118
- ## Inference
119
- * You must pre-pend the corresponding language token to the source side before binarize the test data.
120
- ```bash
121
- fairseq-generate ${test_path} \
122
- --user-dir ${repo_dir}/mcolt \
123
- -s ${src} \
124
- -t ${tgt} \
125
- --skip-invalid-size-inputs-valid-test \
126
- --path ${ckpts} \
127
- --max-tokens ${batch_size} \
128
- --task translation_w_langtok \
129
- ${options} \
130
- --lang-prefix-tok "LANG_TOK_"`echo "${tgt} " | tr '[a-z]' '[A-Z]'` \
131
- --max-source-positions ${max_source_positions} \
132
- --max-target-positions ${max_target_positions} \
133
- --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ${final_res_file}
134
- python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1;
135
- ```
136
-
137
- ## Synonym dictionaries
138
- We use the bilingual synonym dictionaries provised by [MUSE](https://github.com/facebookresearch/MUSE).
139
-
140
- We generate multilingual synonym dictionaries using [this script](https://github.com/linzehui/mRASP/blob/master/preprocess/tools/ras/multi_way_word_graph.py), and apply
141
- RAS using [this script](https://github.com/linzehui/mRASP/blob/master/preprocess/tools/ras/random_alignment_substitution_w_multi.sh).
142
-
143
- | Description | File | Size |
144
- | --- | --- | --- |
145
- | dep=1 | [synonym_dict_raw_dep1](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/synonym_dict_raw_dep1) | 138.0 M |
146
- | dep=2 | [synonym_dict_raw_dep2](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/synonym_dict_raw_dep2) | 1.6 G |
147
- | dep=3 | [synonym_dict_raw_dep3](https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2021/mrasp2/synonym_dict_raw_dep3) | 2.2 G |
148
-
149
- ## Contact
150
- Please contact me via e-mail `[email protected]` or via [wechat/zhihu](https://fork-ball-95c.notion.site/mRASP2-4e9b3450d5aa4137ae1a2c46d5f3c1fa) or join [the slack group](https://mrasp2.slack.com/join/shared_invite/zt-10k9710mb-MbDHzDboXfls2Omd8cuWqA)!
151
-
152
- ## Citation
153
- Please cite as:
154
- ```
155
- @inproceedings{mrasp2,
156
- title = {Contrastive Learning for Many-to-many Multilingual Neural Machine Translation},
157
- author= {Xiao Pan and
158
- Mingxuan Wang and
159
- Liwei Wu and
160
- Lei Li},
161
- booktitle = {Proceedings of ACL 2021},
162
- year = {2021},
163
- }
164
- ```