myeongho-jeong
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ model-index:
|
|
8 |
results: []
|
9 |
---
|
10 |
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
|
11 |
-
#
|
12 |
|
13 |
## Join Our Community on Discord!
|
14 |
|
@@ -27,8 +27,33 @@ If you're passionate about the field of Large Language Models and wish to exchan
|
|
27 |
This model is a Korean vocabulary-extended version of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2), specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lm_head` embeddings for the already existing tokens while preserving the original parameters of the base model.
|
28 |
|
29 |
### Technical Deep Dive
|
|
|
|
|
|
|
30 |
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
### Usage and Limitations
|
34 |
|
@@ -36,4 +61,43 @@ Keep in mind that this model hasn't been fine-tuned with instruction-based train
|
|
36 |
|
37 |
### Training Details
|
38 |
|
39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
results: []
|
9 |
---
|
10 |
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
|
11 |
+
# EEVE-Korean-2.8B-v1.0
|
12 |
|
13 |
## Join Our Community on Discord!
|
14 |
|
|
|
27 |
This model is a Korean vocabulary-extended version of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2), specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lm_head` embeddings for the already existing tokens while preserving the original parameters of the base model.
|
28 |
|
29 |
### Technical Deep Dive
|
30 |
+
<p align="left">
|
31 |
+
<img src="https://huggingface.co/yanolja/EEVE-Korean-2.8B-v1.0/resolve/main/EEVE_figure.png" width="100%"/>
|
32 |
+
<p>
|
33 |
|
34 |
+
To adapt foundational models from English to Korean, we use subword-based embedding with a seven-stage training process involving parameter freezing.
|
35 |
+
This approach progressively trains from input embeddings to full parameters, efficiently extending the model's vocabulary to include Korean.
|
36 |
+
Our method enhances the model's cross-linguistic applicability by carefully integrating new linguistic tokens, focusing on causal language modeling pre-training.
|
37 |
+
We leverage the inherent capabilities of foundational models trained on English to efficiently transfer knowledge and reasoning to Korean, optimizing the adaptation process.
|
38 |
+
|
39 |
+
For detail, please refer our technical report(TBU) - [Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models](https://arxiv.org).
|
40 |
+
|
41 |
+
Here’s an simplified code for our key approach:
|
42 |
+
|
43 |
+
```python
|
44 |
+
# number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
|
45 |
+
def freeze_partial_embedding_hook(grad):
|
46 |
+
grad[:number_of_old_tokens] = 0
|
47 |
+
return grad
|
48 |
+
|
49 |
+
for name, param in model.named_parameters():
|
50 |
+
if ("lm_head" in name or "embed_tokens" in name) and "original" not in name:
|
51 |
+
param.requires_grad = True
|
52 |
+
if "embed_tokens" in name:
|
53 |
+
param.register_hook(freeze_partial_embedding_hook)
|
54 |
+
else:
|
55 |
+
param.requires_grad = False
|
56 |
+
```
|
57 |
|
58 |
### Usage and Limitations
|
59 |
|
|
|
61 |
|
62 |
### Training Details
|
63 |
|
64 |
+
Our model’s training was comprehensive and diverse:
|
65 |
+
|
66 |
+
- **Data Sources:**
|
67 |
+
- English to Korean paragraph pairs: 5.86%
|
68 |
+
- Multi-lingual corpus (primarily English): 10.69%
|
69 |
+
- Korean web content: 83.46%
|
70 |
+
|
71 |
+
- **Vocabulary Expansion:**
|
72 |
+
We meticulously selected 8,960 Korean tokens based on their frequency in our Korean web corpus. This process involved multiple rounds of tokenizer training, manual curation, and token frequency analysis, ensuring a rich and relevant vocabulary for our model.
|
73 |
+
|
74 |
+
1. **Initial Tokenizer Training:** We trained an intermediate tokenizer on a Korean web corpus, with a vocabulary of 40,000 tokens.
|
75 |
+
|
76 |
+
2. **Extraction of New Korean Tokens:** From the intermediate tokenizer, we identified all Korean tokens not present in the original SOLAR's tokenizer.
|
77 |
+
|
78 |
+
3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
|
79 |
+
|
80 |
+
4. **Frequency Analysis:** Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
|
81 |
+
|
82 |
+
5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
|
83 |
+
|
84 |
+
6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times.
|
85 |
+
|
86 |
+
7. **Iterative Refinement:** We repeated steps 2 to 6 until there were no tokens to drop or add.
|
87 |
+
|
88 |
+
8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
|
89 |
+
|
90 |
+
This rigorous approach ensured a comprehensive and contextually rich Korean vocabulary for the model.
|
91 |
+
|
92 |
+
## Citation
|
93 |
+
|
94 |
+
```
|
95 |
+
@misc{kim2024efficient,
|
96 |
+
title={Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models},
|
97 |
+
author={Seungduk Kim, Seungtaek Choi, Myeongho Jeong},
|
98 |
+
year={2024},
|
99 |
+
eprint={2402.XXXXX},
|
100 |
+
archivePrefix={arXiv},
|
101 |
+
primaryClass={cs.CL}
|
102 |
+
}
|
103 |
+
```
|