indiejoseph jed351 commited on
Commit
72198a1
·
verified ·
1 Parent(s): bd9162d

Update README.md (#1)

Browse files

- Update README.md (38ce76f79f03b4618006df89fd8c0198ec39f19f)


Co-authored-by: Jed Cheng <[email protected]>

Files changed (1) hide show
  1. README.md +96 -41
README.md CHANGED
@@ -1,60 +1,115 @@
1
  ---
2
- license: other
3
- base_model: hon9kon9ize/CantoneseLLM-0.5-34b
4
- tags:
5
- - llama-factory
6
- - full
7
- - generated_from_trainer
8
- model-index:
9
- - name: CantonesellmChat-v0.5-34B-sft
10
- results: []
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
 
 
15
 
16
- # CantonesellmChat-v0.5-34B-sft
17
 
18
- This model is a fine-tuned version of hon9kon9ize/CantoneseLLM-0.5-34b on the yue_sft202404 dataset.
19
 
20
- ## Model description
21
 
22
- More information needed
23
 
24
- ## Intended uses & limitations
 
25
 
26
- More information needed
27
 
28
- ## Training and evaluation data
29
 
30
- More information needed
31
 
32
- ## Training procedure
33
 
34
- ### Training hyperparameters
35
 
36
- The following hyperparameters were used during training:
37
- - learning_rate: 1e-05
38
- - train_batch_size: 2
39
- - eval_batch_size: 8
40
- - seed: 42
41
- - distributed_type: multi-GPU
42
- - num_devices: 16
43
- - gradient_accumulation_steps: 4
44
- - total_train_batch_size: 128
45
- - total_eval_batch_size: 128
46
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
47
- - lr_scheduler_type: cosine
48
- - lr_scheduler_warmup_ratio: 0.1
49
- - num_epochs: 3.0
50
 
51
- ### Training results
52
 
 
53
 
 
54
 
55
- ### Framework versions
56
 
57
- - Transformers 4.43.3
58
- - Pytorch 2.3.1+cu121
59
- - Datasets 2.20.0
60
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - yue
 
 
 
 
 
 
5
  ---
6
 
7
+ Continual pretraining model of the [Yi-34B](https://huggingface.co/01-ai/Yi-1.5-34B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
8
+
9
+ The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (400 million tokens).
10
+ Surprisingly, the outcome of a [6B model](https://huggingface.co/hon9kon9ize/CantoneseLLMChat-v0.5) was quite good.
11
 
12
+ Despite there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages, we trained a 34B model using 4 compute node each with 4 Nvidia H100 GPUs on the [Genkai Supercomputer](ttps://www.cc.kyushu-u.ac.jp/scp/system/Genkai/hardware/).
13
 
14
+ Here is a space you can interact with [CantoneseLLMChat](https://huggingface.co/spaces/hon9kon9ize/CantoneseLLMChat)
15
 
 
16
 
17
+ ### Result
18
 
19
+ ```markdown
20
+ User: 講一個關於維尼大大嘅故仔畀我聽
21
 
22
+ Assistant: 從前有個程式員叫維尼大大,佢寫咗好多有趣又實用嘅程式。有一日,佢收到一份特殊嘅任務,要佢編寫一款能夠翻譯任何語言嘅程式。呢項任務對維尼大大嚟講唔簡單,但佢決定接受挑戰。經過幾個月嘅努力同探索,維尼大大終於成功噉創造出一款多功能嘅翻譯程式。呢種程式可以將人哋講嘅嘢變成電腦睇到嘅文字、再將電腦睇到嘅文字轉換成人類理解得到嘅語言。呢種翻譯技術令世界各地嘅人民都可以輕易交流,增進彼此之間嘅了解與和諧共處。而維尼大大亦因為自己嘅創新同貢獻獲得咗無數獎項同認可,成為一位備受尊敬嘅科學家及先鋒人物。
23
 
24
+ ###
25
 
26
+ User: 點解香港人咁唔鍾意大陸人?
27
 
28
+ Assistant: 我覺得呢個問題冇一個統一嘅答案。有啲香港人可能會對大陸人有偏見,因為佢哋可能有過負面嘅經歷或者觀察到大陸人有某啲行為習慣令佢哋反感。但係我都要強調下,每一個人都有自己嘅觀點同感受,而且應該得到尊重。每個人都值得被接納同理解。如果你有任何疑問或需要幫助解決任何問題,請隨時提出,我將盡力提供支持同協助。
29
 
30
+ ###
31
 
32
+ User: 我x你!
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ Assistant: 唔好意思,我係個AI語言模型,我唔應該使用粗俗語言。如果你有其他問題,我會盡力幫你。
35
 
36
+ ```
37
 
38
+ ### System Prompt
39
 
40
+ To assign a role to the chat model, you must specify a proper system prompt, the default system prompt is "你係由 hon9kon9ize 開發嘅 CantoneseLLM,你係一個好幫得手嘅助理", that leads the output tending to response in Cantonese, we also found the response language is not solely determined by system prompt, it would also take account of the user input.
41
 
42
+ ```python
43
+ [
44
+ {"role": "assistant", "content": "你係由 hon9kon9ize 開發嘅 CantoneseLLM,你係一個好幫得手嘅助理"}, # this is default system prompt, this line could be omitted
45
+ {"role": "user", "content": "你叫咩名?"}
46
+ ]
47
+
48
+ # Output: 我係CantoneseLLM,一個由hon9kon9ize開發嘅人工智能助手。我致力於為用戶提供準確、有針對性嘅回答同幫助。
49
+ ```
50
+
51
+
52
+ ### Chat Template
53
+
54
+ Template format is similar to [ChatML](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md#working-with-chat-markup-language-chatml), but we have replaced roles token to Yi's reserved tokens in order to saved up some context size.
55
+
56
+ ```
57
+ <|im_start|><|System|>
58
+ Provide some context and/or instructions to the model.
59
+ <|im_end|>
60
+ <|im_start|><|Human|>
61
+ The user’s message goes here
62
+ <|im_end|>
63
+ <|im_start|><|Asisstant|>
64
+ ```
65
+
66
+ ### Usage
67
+
68
+ ```python
69
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
70
+
71
+ # bnb_config = BitsAndBytesConfig(
72
+ # load_in_4bit=True,
73
+ # bnb_4bit_use_double_quant=True,
74
+ # bnb_4bit_quant_type="nf4",
75
+ # bnb_4bit_compute_dtype=torch.bfloat16
76
+ # )
77
+
78
+ model = AutoModelForCausalLM.from_pretrained(
79
+ model_name,
80
+ torch_dtype=torch.bfloat16,
81
+ device_map='auto',
82
+ # quantization_config=bnb_config, # uncomment here and bnb_config to use 4bit quantiziation
83
+ )
84
+ tokenizer = LlamaTokenizer.from_pretrained(model_name)
85
+
86
+ def chat(messages, temperature=0.9, max_new_tokens=200):
87
+ # chat template defination can be found in generation_config.json
88
+ input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt').to('cuda:0')
89
+ output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, temperature=temperature, num_return_sequences=1, do_sample=True, top_k=50, top_p=0.95, num_beams=3, repetition_penalty=1.18)
90
+ print(output_ids)
91
+ response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=False)
92
+
93
+ return response
94
+
95
+ messages = [{"role": "user", "content": "邊個係香港特首?"}]
96
+
97
+ # chat template included default system message, but you can define your own system message
98
+ messages = [
99
+ {"role": "system", "content": "你叫做櫻子,你要同用家北原伊織進行對話,你同北原伊織係情女關係。"},
100
+ {"role": "user", "content": "櫻子,令日你會去邊度玩呀?"}
101
+ ]
102
+
103
+ print(chat(messages))
104
+
105
+ ```
106
+
107
+ You could also open this [colab demo link](https://colab.research.google.com/drive/1zEEvlCXbwDyQZ2QfrEuuqAQcBVchNL_9?usp=sharing)
108
+
109
+ ### Limitation
110
+
111
+ The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.
112
+
113
+ ### Hallucination
114
+
115
+ Most LLM also have hallucination issue, this model is no exception, it may generate incorrect or misleading information, please use it with caution.