Upload kin_en.md
Browse files
kin_en.md
ADDED
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Kinyarwanda-to-English Machine Translation
|
2 |
+
|
3 |
+
This model is an Kinyarwanda-to-English machine translation model, it was built and trained using JoeyNMT framework. The translation model uses transformer encoder-decoder based architecture. It was trained on a 47,211 long English-Kinyarwanda bitext dataset prepared by Digital Umuganda.
|
4 |
+
|
5 |
+
|
6 |
+
## Model architecture
|
7 |
+
**Encoder && Decoder**
|
8 |
+
> Type: Transformer
|
9 |
+
Num_layer: 6
|
10 |
+
Num_heads: 8
|
11 |
+
Embedding_dim: 256
|
12 |
+
ff_size: 1024
|
13 |
+
Dropout: 0.1
|
14 |
+
Layer_norm: post
|
15 |
+
Initializer: xavier
|
16 |
+
Total params: 12563968
|
17 |
+
|
18 |
+
## Pre-processing
|
19 |
+
|
20 |
+
Tokenizer_type: subword-nmt
|
21 |
+
num_merges: 4000
|
22 |
+
BPE encoding learned on the bitext, separate vocabularies for each language
|
23 |
+
Pretokenizer: None
|
24 |
+
No lowercase applied
|
25 |
+
|
26 |
+
## Training
|
27 |
+
Optimizer: Adam
|
28 |
+
Loss: crossentropy
|
29 |
+
Epochs: 30
|
30 |
+
Batch_size: 256
|
31 |
+
Number of GPUs: 1
|
32 |
+
|
33 |
+
|
34 |
+
|
35 |
+
## Evaluation
|
36 |
+
|
37 |
+
Evaluation_metrics: Blue_score, chrf
|
38 |
+
Tokenization: None
|
39 |
+
Beam_width: 15
|
40 |
+
Beam_alpha: 1.0
|
41 |
+
|
42 |
+
## Tools
|
43 |
+
* joeyNMT 2.0.0
|
44 |
+
* datasets
|
45 |
+
* pandas
|
46 |
+
* numpy
|
47 |
+
* transformers
|
48 |
+
* sentencepiece
|
49 |
+
* pytorch(with cuda)
|
50 |
+
* sacrebleu
|
51 |
+
* protobuf>=3.20.1
|
52 |
+
|
53 |
+
## How to train
|
54 |
+
|
55 |
+
[Use the following link for more information](https://github.com/joeynmt/joeynmt)
|
56 |
+
|
57 |
+
## Translation
|
58 |
+
To install joeyNMT run:
|
59 |
+
>$ git clone https://github.com/joeynmt/joeynmt.git
|
60 |
+
$ cd joeynmt
|
61 |
+
$ pip install . -e
|
62 |
+
|
63 |
+
Interactive translation(stdin):
|
64 |
+
>$ python -m joeynmt translate configs/args.yaml
|
65 |
+
|
66 |
+
File translation:
|
67 |
+
>$ python -m joeynmt translate configs/args.yaml < src_lang.txt > hypothesis_trg_lang.txt
|
68 |
+
|
69 |
+
## Accuracy measurement
|
70 |
+
Sacrebleu installation:
|
71 |
+
> $ pip install sacrebleu
|
72 |
+
|
73 |
+
Measurement(bleu_score, chrf):
|
74 |
+
> $ sacrebleu reference.tsv -i hypothesis.tsv -m bleu chrf
|
75 |
+
|
76 |
+
|
77 |
+
## To-do
|
78 |
+
|
79 |
+
>* Test the model using differenct datasets including the jw300
|
80 |
+
>* Use the Digital Umuganda dataset on some of the available State Of The Art(SOTA) available models.
|
81 |
+
>* Expand the dataset
|
82 |
+
|
83 |
+
## Result
|
84 |
+
The following result were obtained on using the sacrebleu.
|
85 |
+
|
86 |
+
|
87 |
+
Kinyarwanda-to-English:
|
88 |
+
>Blue: 79.87
|
89 |
+
>Chrf: 84.40
|
90 |
+
|
91 |
+
|
92 |
+
|
93 |
+
|
94 |
+
|
95 |
+
|
96 |
+
|
97 |
+
|
98 |
+
|
99 |
+
|
100 |
+
|