File size: 1,372 Bytes
e289cbd
 
bd3339f
 
 
 
 
 
 
 
 
e81569f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: cc-by-nc-sa-4.0
language:
- en
- ko
metrics:
- bleu
pipeline_tag: text2text-generation
tags:
- nmt
- aihub
---

# ENKO-T5-SMALL-V0

This model is for English to Korean Machine Translator, which is based on T5-small architecture, but trained from scratch.

#### Code

The training code is from my lecture([LLM을 위한 김기현의 NLP EXPRESS](https://fastcampus.co.kr/data_online_nlpexpress)), which is published on [FastCampus](https://fastcampus.co.kr/). You can check the training code in this github [repo](https://github.com/kh-kim/nlp-express-practice).

#### Dataset

The training dataset for this model is mainly from [AI-Hub](https://www.aihub.or.kr/). The dataset consists of 11M parallel samples.

#### Tokenizer

I use Byte-level BPE tokenizer for both source and target language. Since it covers both languages, tokenizer vocab size is 60k.

#### Architecture

The model architecture is based on T5-small, which is popular encoder-decoder model architecture. Please, note that this model is trained from-scratch, not fine-tuned.

#### Evaluation

I conducted the evaluation with 5 different test sets. Following figure shows BLEU scores on each test set.

![](images/enko.png)

![](images/avg.png)

DEEPCL model is private version of this model, which is trained on much more data.

#### Contact

Kim Ki Hyun ([email protected])