CantoneseLLM-6B-preview202402 / README.md

Update README.md

345c4d2 verified 10 months ago

5.84 kB

	---
	language:
	- yue
	license: other
	license_name: yi-license
	license_link: https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE
	pipeline_tag: text-generation
	model-index:
	- name: CantoneseLLM-6B-preview202402
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 55.63
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 75.8
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 63.07
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 42.26
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 74.11
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 30.71
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=hon9kon9ize/CantoneseLLM-6B-preview202402
	name: Open LLM Leaderboard
	---

	# CantoneseLLM

	This model is further pre-trained model based on [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) with 800M tokens of Cantonese text compiled from various sources, including translated zh-yue Wikipedia, translated RTHK news [datasets/jed351/rthk_news](https://huggingface.co/datasets/jed351/rthk_news), Cantonese filtered CC100 and Cantonese textbooks generated by Gemini Pro.

	This is a preview version, for experimental use only, we will use it to fine-tune on downstream tasks and evaluate the performance.


	### [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_hon9kon9ize__CantoneseLLM-6B-preview202402)
	\| Metric \| Value \|
	\|-----------------------\|---------------------------\|
	\| Avg. \| 56.93 \|
	\| ARC (25-shot) \| 55.63 \|
	\| HellaSwag (10-shot) \| 75.8 \|
	\| MMLU (5-shot) \| 63.07 \|
	\| TruthfulQA (0-shot) \| 42.26 \|
	\| Winogrande (5-shot) \| 74.11 \|
	\| GSM8K (5-shot) \| 30.71 \|


	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")
	model = AutoModelForMaskedLM.from_pretrained("hon9kon9ize/CantoneseLLM-6B-preview202402")

	prompt = "歷經三年疫情，望穿秋水終於全面復常，隨住各項防疫措施陸續放寬以至取消，香港"

	input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cuda:0')
	output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, repetition_penalty=1.1, do_sample=True, temperature=temperature, top_k=50, top_p=0.95)
	output = tokenizer.decode(output[0], skip_special_tokens=True)

	# output: 歷經三年疫情，望穿秋水終於全面復常，隨住各項防疫措施陸續放寬以至取消，香港旅遊業可謂「起死回生」。
	# 不過，旅遊業嘅復蘇之路並唔順利，香港遊客數量仍然遠低於疫前水平，而海外旅客亦只係恢復到疫情前約一半。有業界人士認為，當局需要進一步放寬入境檢疫措施，吸引更多國際旅客來港，令旅遊業得以真正復甦。
	```

	## Limitation and Bias

	The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.

	We found the model is not well trained on the updated Hong Kong knowledge, it may due to the corpus is not large enough to brainwash the original model. We will continue to improve the model and corpus in the future.