myeongho-jeong commited on
Commit
24d7d3e
·
verified ·
1 Parent(s): 2a3e064

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -3
README.md CHANGED
@@ -8,7 +8,7 @@ model-index:
8
  results: []
9
  ---
10
  [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
11
- # yanolja/EEVE-Korean-2.8B-v1.0
12
 
13
  ## Join Our Community on Discord!
14
 
@@ -27,8 +27,33 @@ If you're passionate about the field of Large Language Models and wish to exchan
27
  This model is a Korean vocabulary-extended version of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2), specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lm_head` embeddings for the already existing tokens while preserving the original parameters of the base model.
28
 
29
  ### Technical Deep Dive
 
 
 
30
 
31
- TBU
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ### Usage and Limitations
34
 
@@ -36,4 +61,43 @@ Keep in mind that this model hasn't been fine-tuned with instruction-based train
36
 
37
  ### Training Details
38
 
39
- TBU
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  results: []
9
  ---
10
  [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
11
+ # EEVE-Korean-2.8B-v1.0
12
 
13
  ## Join Our Community on Discord!
14
 
 
27
  This model is a Korean vocabulary-extended version of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2), specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lm_head` embeddings for the already existing tokens while preserving the original parameters of the base model.
28
 
29
  ### Technical Deep Dive
30
+ <p align="left">
31
+ <img src="https://huggingface.co/yanolja/EEVE-Korean-2.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
32
+ <p>
33
 
34
+ To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing.
35
+ This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean.
36
+ Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training.
37
+ We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process.
38
+
39
+ For detail, please refer our technical report(TBU) - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
40
+
41
+ Here’s an simplified code for our key approach:
42
+
43
+ ```python
44
+ # number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
45
+ def freeze_partial_embedding_hook(grad):
46
+ grad[:number_of_old_tokens] = 0
47
+ return grad
48
+
49
+ for name, param in model.named_parameters():
50
+ if ("lm_head" in name or "embed_tokens" in name) and "original" not in name:
51
+ param.requires_grad = True
52
+ if "embed_tokens" in name:
53
+ param.register_hook(freeze_partial_embedding_hook)
54
+ else:
55
+ param.requires_grad = False
56
+ ```
57
 
58
  ### Usage and Limitations
59
 
 
61
 
62
  ### Training Details
63
 
64
+ Our model’s training was comprehensive and diverse:
65
+
66
+ - **Data Sources:**
67
+ - English to Korean paragraph pairs: 5.86%
68
+ - Multi-lingual corpus (primarily English): 10.69%
69
+ - Korean web content: 83.46%
70
+
71
+ - **Vocabulary Expansion:**
72
+ We meticulously selected 8,960 Korean tokens based on their frequency in our Korean web corpus. This process involved multiple rounds of tokenizer training, manual curation, and token frequency analysis, ensuring a rich and relevant vocabulary for our model.
73
+
74
+ 1. **Initial Tokenizer Training:** We trained an intermediate tokenizer on a Korean web corpus, with a vocabulary of 40,000 tokens.
75
+
76
+ 2. **Extraction of New Korean Tokens:** From the intermediate tokenizer, we identified all Korean tokens not present in the original SOLAR's tokenizer.
77
+
78
+ 3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
79
+
80
+ 4. **Frequency Analysis:** Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
81
+
82
+ 5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
83
+
84
+ 6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times.
85
+
86
+ 7. **Iterative Refinement:** We repeated steps 2 to 6 until there were no tokens to drop or add.
87
+
88
+ 8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
89
+
90
+ This rigorous approach ensured a comprehensive and contextually rich Korean vocabulary for the model.
91
+
92
+ ## Citation
93
+
94
+ ```
95
+ @misc{kim2024efficient,
96
+ title={Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models},
97
+ author={Seungduk Kim, Seungtaek Choi, Myeongho Jeong},
98
+ year={2024},
99
+ eprint={2402.XXXXX},
100
+ archivePrefix={arXiv},
101
+ primaryClass={cs.CL}
102
+ }
103
+ ```